Jean Véronis

Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

mercredi, mai 11, 2005

Google: TrustRank, much ado about nothing?

The mighty Godgle has stirred up in Mountainview, causing immediate ripples of excitement amongst the mortal Internauts down here in the agorasphere. What’s going on? What are these murmurings from the God? Is Godgle’s mood about to change? Will he get angry and send all spammers and referencers straight to Hell? Or will he pour on us the Horn of Plenty, delivering streams of News each day that are more beautiful than any we have seen before, all garlanded in flowers? Is Google Juice about to become a delicious nectar that will bestow informational immortality upon us poor Internauts? The danger is great; if we slip up and offend this God, we may be forced to spend eternity in the wastelands of Cyberia, condemned like Tantalus to die of intellectual hunger and thirst with a planetary ocean of information just a mouse-click out of reach ...

le dieu godgle

Let’s examine the facts. On 16 March 2005 Google registered the trademark TrustRank (see the website of the United States Patent and Trademark Office or USPTO). This was picked up by some clued-in Internauts (I think it was Gary Price from SearchEngineWatch [a, b] who spotted it first, but I might be wrong). Immediately the agorasphere began to bubble with agitation [see Slashdot a and b].

Our first clue is that the USPTO site also reveals that Google has registered a patent entitled "Systems and methods for improving the ranking of news articles", published on 17 March 2005. The connection was quickly made: TrustRank must be this new method for ranking news stories. We knew that Google has had difficulties with news stories that were not always relevant (or even hoaxes[fr]) going to the top of their ranking, and this phenomenon has only increased they have added countless blogs to their News sources.

The second clue comes when we find (with some quick googling?) an article presenting TrustRank that has been published by "researchers at Stanford University", first as a technical report (March 2004) and then at the VLDB conference (August 2004). The article is available here as a PDF. Conclusions are quickly jumped to: this must be a description of this magical new algorithm, and the link soon spreads from forum to blog.

Unfortunately, there’s something (several things in fact) not quite right about all this. Firstly, the title of the article in question is "Combating Web Spam with TrustRank". It’s a rather interesting article: it attempts to show how we can fight spam by checking a small number of pages manually and then using an algorithm that allows this initial knowledge to be used to separate the wheat from the chaff. Fine, but what this has to do with News is not immediately apparent. Google’s problem with News is not a problem of spam. Google selects its news sources, and if certain blogs, for instance, find themselves indexed, then Google certainly wants them to be there. News sources are not spam, and they don’t have the formal and textual characteristics of spam. No, the problem with news lies elsewhere. It’s not that the sources are undesirable (or they can easily be excluded from the index if they are), but they could be considered more or less reliable (if anyone can really claim to know what that means in terms of news) and we want to give greater weight to more reliable sources in the ranking.

However, we could imagine how a similar method might be applied, with a few news items being checked manually and an algorithm that allows this “learning” to be extended to all the rest of the news items, and then we would have our link to Google's patent. But there’s just one problem with this idea. On reading the article, we learn that, in addition to the “researchers at Stanford University”, one of its co-authors is none other than Jan Pedersen, Chief Scientist at Yahoo! (and whom I wrote about here the other day) – and I doubt very much that Yahoo! wants to let Google benefit from its technological breakthroughs! A more plausible hypothesis could be that Google has simply stolen the word (which sounds really good) from under the nose of Yahoo! by trying to register it first – there’s nothing new about that! Let’s see what the attorneys make of it all, since the trademark has yet to be attributed to Google and the dossier is still under examination.

As far as this patent is concerned, there’s nothing to indicate that it has anything to do with the TrustRank trademark that Google is so interested in. It might. It might not. We really have no idea. The patent makes no mention of this term. The closeness of the two dates (16 and 17 March) would seem to me to be nothing more than a coincidence, for if we read the patent dossier carefully, we discover that 17 March is the date on which it was published by the USPTO (which is not the same as acceptance – they have a long way to go for that ...). The dossier was presented on 16 September 2003, and I doubt whether Google has much control over the patent office’s calendar ...

le dieu godgle

I have read this patent in detail. As its name suggests, it describes how to improve the ranking of news articles in a system like Google News. Once we have managed to plough through all the verbiage – for the text is written in the inimitable style of all patents – what lies at the heart of the “invention” actually boils down to relatively little, in my opinion. It consists of a group of 13 "metrics" (even if this term is not used in the mathematically correct sense, since most of them do not respect triangular inequality – but it doesn’t matter, let’s just take the word in its more general meaning of measurement) that allow each news source to be weighted. Here they are:

1. Number of articles produced by the source
2. Average article length
3. Coverage of the source (basically, how many stories does the source produce compared to the overall number of sources)
4. “Breaking score", i.e. how quickly the source publishes any given news item
5. An indication of how often this source’s news items are used (based on click-through rate)
6. A human opinion of this source (well, well, well!)
7. External audience measurement statistics such as Media Metrix or Nielsen Netratings
8. The size of the staff, which can be determined by the number of different journalists under whose names the news items appear (so no more blogs)
9. The number of different offices or agencies the source has (so, again, no more blogs)
10. The number of original named entities quoted by the source (people, organisations, places) – almost certainly with the idea that secondary sources pass on information but rarely add anything new
11. Breadth, i.e. the number of stories covered by the source
12. International diversity (so the “Gazette du Périgord” is out of luck)
13. Writing style, in terms of a) spelling b) grammar (I wonder how they plan to evaluate that!) and c) reading level, which I imagine concerns standard notions regarding sentence length, rarity of vocabulary, etc.

I’m amazed. To be perfectly honest, I think that I could have got together with a few of my students over a couple of beers and we would have come up with more or less the same ideas (and a few more besides, I’m sure) in a single afternoon. Admittedly, the “invention” does add a few ideas on how to combine these metrics (and experts will see that therein lies the problem!), ideas such as how to work out the average (yes, really!) and other, only slightly more complicated things – but, quite frankly, it’s hardly Nobel Prize winning stuff.

Beyond this evaluation, which I accept is open to debate, what the agorasphere seems to have missed is the submission date of this patent, September 2003. You don’t have to wait for a patent to be granted in order to implement an idea, and it’s clear that whatever good ideas this patent may contain have long since been incorporated in Google News (along with a whole host of others, no doubt, that are still undergoing experimentation and may lead to other patents in the future). Above all, the role of a patent is the demarcation of territory in order to keep competitors out, and this particular patent seems quite clear in this regard. This patent, as we have seen, is very broad in its scope, listing everything that was on the minds of the Googlers in 2003 (without going into any detail, as in the question of style, for instance) and even what wasn’t, for we haven’t counted all the phrases like “for example” and “in one implementation consistent with the principles of the invention”, we could do it this way or we could do something else entirely ... But let’s not get into a debate about the role of software patents – there are plenty of other sites on the subject already (example).

le dieu godgle

So, there's nothing new under the sun then – but still, I must admit to being stupefied by the incredible naivety of this patent. Do the Googlers realise the paradox on which it is built? If we really were to implement the proposed metrics, where would it lead? A handful of sources would always be at the top of the ranking; international sources with lots of agencies around the world, a large staff who write in an impeccable style and whose articles are always of a standard length, sources that cover all subjects and are quick to break news ... and are considered to be excellent by humans and by the appropriate bodies (Nielsen and others). Well, I can tell you what the result would be. Reuters and AP at the top of the list, along with perhaps their Chinese and Indian equivalents in order to avoid offending anyone (and let’s not even mention AFP!).

In this case, why bother indexing thousands of sources (as Google claims to do now)? Surely it would be enough to subscribe to Reuters’ RSS feeds; the result would be more or less the same. The humble Internauts seem to have reached this conclusion as well; when it comes to the adoration of News, 75% of them, at least in the United State, worship the god Yahoo. Yahoo, at least on its US site, has done a good job of polishing up the interface and, when it’s official news we’re after, offers a high quality choice, selected by humans (see metric 6 in the patent!). The French version does the best it can, and is not too bad editorially, despite having more limited resources and a much smaller team (but the redesigned interface has yet to reach France, however).

As far as more specialised news stories are concerned, stories that are not already all over the Web, well, sites like these are not really where you’re going to find them ... Let me once again mention Rezo [fr], an intelligent alternative news aggregator. And let us pray to the cyber-gods for the development of others, so that we can enjoy a range of different points of view, instead of being forced to see the world according to Google [fr] [en].

In short, it’s probably all much ado about nothing ...

See also

Libellés :

0 Commentaires:

Enregistrer un commentaire