Thursday, May 13, 2010

We know that David Cameron doesn't like Twitter - what does it think of him?

Over the course of the General Election I recorded 1000 random tweets every hour and sent them to tweetsentiments.com for sentiment analysis.

Tweetsentiment have a service which gives one of three values to each tweet. '0' means a negative sentiment (unhappy tweet), '2' a neutral or undetermined sentiment and '4' positive (happy tweet). Similar technology is used to detect levels of customer satisfaction at call centres by monitoring phone calls.

Obviously it's difficult for a machine to detect the emotional meaning of a sentence, especially with the strange conventions used on Twitter. Despite this Tweetsentiment seems to be fairly reliable - tweets always which express happy emotions tend to be rated as such, and vice verse. More accurately, if Tweetsentiment does make a classification it tends to get it right. Sometimes an obviously positive / negative tweet gets a '2', but that shouldn't affect things here.

My hypothesis was that the Twitterati would be less happy if there was a Conservative victory. Of course I can’t prove that Twitter has a bias to the left, but I would presume that young, techy, early adopters are more likely to be left leaning. The reaction to the Jan Moir Stephen Gately article perhaps supports this.

David Cameron famously noted that Twitter is for twats, I wondered if Twitter would reciprocate...



The graph indicates that usually Twitter is just slightly positive, with a mood value of 2.1 on average. As predicted, as a conservative victory becomes apparent on Thursday evening there is a decline in mood which lasts until Saturday lunchtime. Then everyone cheers up, presumably goes down the pub, and is pretty chirpy for Sunday lunch. Sentiment only returns to average for the beginning of work on Monday morning.

In short, it does look like the election result was a disappointment to Twitter.

Obviously we need to know what normal Twitter behaviour is over the course of the week to draw very much information from the graph, and this is something that I’m going to try and produce a graph for soon.

It does look as though the size of negative reaction to a once-a-decade change in government is about the same magnitude as the positive mood elicited by the prospect of Sunday lunch - which I think is fairly consistent with the vicissitudes of Twitter as I experienced them.

I used Twitter’s API to gather the data, and frankly, it’s not particularly great, particularly if you want to get Tweets from the past. I was surprised to discover that any Tweets more than about 24 hours old simply disappear from the search function on Twitter.com – in effect they only exist in public for a day. For this reason the hourly sample size wasn’t always exactly 1000, but it was on average.

I’ll post again when I have some more data on normal behaviour. I’m also curious to find out if different countries have different average happiness levels on Twitter, but I think finding a Tweetsentiment-style service for other languages might prove difficult.

Tuesday, May 4, 2010

Michael Jackson is more important than Jesus. Fact.

My last post used Wikipedia’s list of dates of births and deaths to build a timeline showing the lifespans of people who have pages on Wikipedia. There are a lot of people with Wikipedia pages, so I limited it to only include dead people.

That still leaves you with a lot of people to fit on one timeline, so I wanted to prioritise ‘important’ or ‘interesting’ people at the top and show only the most 'important' 1000. Some have been confused by my method for doing this, and others have questioned its validity, so this post will address both issues. I’m also going to suggest an improvement. It turns out that whatever I do Michael Jackson is more important than Jesus. I'm just the messenger.

Explaining the method
To get a measure of ‘importance’ I used work done by Stephan Dolan. He has developed a system for ranking Wikipedia pages which is very similar to the PageRank system which Google uses to prioritise its search results.

Wikipedia’s pages link to one another, and Stephan Dolan’s algorithm gives a measure of well linked to all the other Wikipedia pages a particular page is. If we want to know how well linked in the page about Charles Darwin is the algorithm examines every other page in Wikipedia and works out how many links you would have to follow to get from the page it is examining to the Charles Darwin page using the shortest route.

For example, to get from Aldous Huxley to Charles Darwin takes two links, one from Aldous to Thomas Henry Huxley (Aldous’s father) and then another to Darwin (TH Huxley famously defended evolution as a theory). Dolan’s method calculates the average number of clicks from every page in Wikipedia to the Charles Darwin page, and then takes an average value. To get to Charles Darwin takes an average 3.88 clicks from other Wikipedia pages.

Equivalently, Google shows pages that have many links pointing to them nearer the top in its search results.

This method works OK, but it could be better. For example Mircea Eliade ranks as the fifth most important dead person dead person on Wikipedia, taking on average 3.78 clicks to find him. But Mircea Eliade is a Romanian historian of religion - hardly a household name. We can take this as a positive statement, perhaps Mircea Eliade is a figure of hither to unrecognised importance and influence. On the other hand it seems impossible that he can be more ‘important’ than Darwin.

Testing the validity of the Dolan Index
I decided it would be interesting to compare what I’m going to call the Dolan index (the average number of clicks as described above) with two other metrics that could be construed as measuring the importance of a person. Before we do that, here is a Graph of what the Dolan index of dead people on Wikipedia looks like.





The bottom axis shows the rank order of pages, from Pope John Paul II, who is has the 275th highest Dolan index on Wikipedia, to Zi Pitcher, who comes 430900th in terms of Dolan index. It makes a very tidy log plot.

As I mentioned previously, the Dolan index is very similar to a Google PageRank, so lets compare them.





The x axis is the same as the first graph, Wikipedia pages from highest to lowest Dolan index. A well linked page has a low Dolan index, but a High PageRank, so I used the reciprocal of PageRank for the y axis. I've also added a log best fit line.

Comparing with PageRank seems to indicate there is a reasonable correlation between Dolan index and PageRank, which is indicated by the fact the first and second graphs have a similar shape.

PageRank is only given in integer values between 1-10 (realistically, all Wikipedia pages have a PageRank between 3-7), so I’ve smoothed the curve using a moving average.

This seems to lend some weight to the Dolan Index as a measure.

I’ve also made a comparison between the Dolan index the number of results returned when searching for a person’s name (without quotes) in Google search. It should be noted that this number seems to be quite unstable - a search will give a slightly different number of results from one day to the next. I’ve used a log scale because of the range of results.






There is barely any correlation here, except a very low values of Dolan index. Despite this, it’s still possible for the number of Google results to be useful, as becomes in apparent when trying to improve my measure of ‘importance’.

A suggestion for improvement
The problem with all the measures seems to be the noise inherent in the system. While Dolan Index, PageRank and number of Google results all provide a rough guide to ‘importance’ or ‘interest’ overall, each of them frequently gives unlikely results. How about using a mixture of all three? Here is a table comparing the top 25 dead people by Dolan index and using a hybrid measure of importance constructed from all three metrics.

Dolan index Hybrid measure
Pope John Paul II Michael Jackson
Michael Jackson Jesus
John F. Kennedy Ronald Reagan
Gerald Ford Jimi Hendrix
Mircea Eliade Abraham Lincoln
Peter Jennings Adolf Hitler
John Lennon Albert Einstein
Adolf Hitler William Shakespeare
Harry S. Truman Charles Darwin
Rold Reagan Oscar Wilde
J. R. R. Tolkien Woodrow Wilson
James Brown Isaac Newton
Anthony Burgess Elvis Presley
Elvis Presley Walt Disney
Christopher Reeve John Lennon
Susan Oliver George Washington
Franklin D. Roosevelt John F. Kennedy
Winston Churchill Timur
Ernest Hemingway Martin Luther
Theodore Roosevelt Voltaire


To get the hybrid measure I just messed around until things felt right. Here is the formula I came up with:

Hybrid measure = ((1/Dolan index)x 20) + (PageRank x0.6) + (log(number of results)x 0.6)

For some reason additive formulas give better results than multiplicative ones.

Using the hybrid measure seems to have removed the surprises (like Peter Jennings) although you might still argue that Oscar Wilde or Jimi Hendrix are much too high. Michael Jackson comes out as bigger than Jesus, but then he is an exceptionally famous person, and he died much more recently than Jesus. Timur (AKA Tamerlane) is a bit of a curiosity.

I considered ignoring Number of Google results because its such a noisy dataset, however it’s the only reason that Jesus appears at all in this list, he gets a very low ranking (4.01) from the Dolan Index. Any formula which brings Jesus out on top (which I think you could make a reasonable case for his deserving, at least over Michael Jackson!), gives all kinds of strage results elsewhere.

I am a bit suspicious of "number of google results" metric. In addition to volatility Number of results fails to take into account that occurrences of words such as "Newtonian" should probably count towards Newton's ranking, but that people called David Mitchell will benefit artificially from the fact that at least two famous people share the name.

Any further investigation would have to consider what made a person ‘important’ – would it simply be how prominent they are in the minds of people (Michael Jackson and Jimi Hendrix) or would it reflect how influential they were (Charles Darwin for example, or the notably absent Karl Marx)?

I love the idea that the web reflects the collective conciousness, a kind of super-brain aggregation of human knowlege.

Just this week the idea of reflecting the whole of reality in one enormous computer system was promoted by Dirk Helbing, although my formula doesn't rate him as very important, so I'm unsure as to how seriously to take this.