Jean Véronis

Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

samedi, mars 01, 2008

Wikio: Intelligent news portal

Quite a while ago now, I promised to talk to you about the intelligent news portal Wikio [fr]. I came across this site in an absent minded glimpse, as with many of you no doubt, and stupidly only saw it as another aggregator, all be it with Digg style vote buttons admittedly, but nothing worth writing home about. Tragic error. Wikio is undoubtedly the service which harbors the most advanced linguistic technology on the Web at the current time (and you’ve noticed that that’s the theme of this blog... it just had to interest me!).

I’ll no doubt come back to it in other postings, but I just wanted to give you an example. Wikio doesn’t just aggregate news and postings ad hoc. When you go to its main competitor, Google News, the home page offers you today’s headlines grouped into major categories (Sports, International, France, Economy, etc.). That’s basically where the intelligence of the service ends. It’s true that when you enter a keyword, the articles are presented to you in aggregate fashion, but this aggregation is of poor quality. Enter “Yahoo”, for example, and you will see that the groups are quite un-readable. Many news items are not grouped at all and the existing groups overlap each other: the Microsoft affair is spread over a variety of groups, etc. (when you enter a query, the page will certainly have changed, but you get the idea). When it came online in 2002, however, I praised this service. Document clustering (and thus news clustering) is an extremely difficult issue, as you can imagine, and the system seemed very promising. Alas, as with many Google products, after its initial launch it hardly evolved, although it officially left the beta version in 2006. Google concentrated more on the number of sources (4500 for English so we’re told) than on their quality, or that of the algorithms… The increase in the number of sources (easy to do automatically) quite logically leads to the deterioration of the clustering quality.

For Wikio, it’s not perfect (the service is clearly announced as a beta version), but the underlying technology is infinitely more promising. Articles (from media or blogs) are not merely grouped into high level categories (Sports, etc.) but in a veritable “knowledge tree” which currently includes over 30 000 categories (at least on the French site -- is more recent and might be a little behind):

If you count, you will see that there aren’t quite 30 000 categories (even on the French site). I asked Wikio the question: it's normal, the list changes constantly and only categories which have had recent news appear.

To my knowledge, the categories are not visible anywhere in tree form, but one can guess the organization by the URL form. Take the “deafness” category for example. When you enter this keyword into the engine, it sends you back to a page containing news on the topic, with an URL giving the following hierarchy:

The Health theme contains numerous sub-themes, including Disability, which in turn contains Deafness. This hierarchy is also clearly given by navigation links in the top left hand corner of the page:

News > Health > Disability > Deafness

The Deafness theme in turn contains other sub-themes: Cochlear implants, Sign language, Lip reading, Cued speech and others. But navigating to the sub-categories is less easy, and it’s a shame (a bunch of tags can indeed be found to the right of the screen, but they are often complex and don't only present daughter categories). One could imagine other more practical solutions (a small scroll down menu for example under the word Deafness in the navigation link at the top of the page).

Don’t think that it only consists of an alert on the keyword deafness as is the case with Google. The page offers articles which don’t contain this word, but which contain related words: deaf, hearing, hearing loss, etc. And, above all, Wikio doesn’t let itself get too much hoodwinked by articles (and there are plenty in its database, I’ve just checked) which talk about the deafness of power, and politicians turning a deaf ear and so on.

Wikio presents a fantastic reservoir of structured information, that is, to my knowledge, unrivalled. The beauty of the thing is that everyone can create their own news pages, either by subscribing directly to a category’s RSS feed (for example here for deafness), or by combining the categories with each other to create one’s own tabs – which can in turn be exploited by a specific RSS feed!

Absolutely fascinating. The possibilities of such a system are mind boggling... Of course, there is some tweaking to be done here and there, as you may imagine. This is the very forefront (and believe me, extremely difficult) in language technologies. And there are some perverse cases. One of my postings, on Google and internet referencing, has gone into the Cosmetics category because I quoted the expression nail varnish for example. But, honestly, only the HAL's grandson [fr]…would be able to resolve that one, and in 3001 no doubt.

I'll be brief... I know that we are in the zapping civilization and that most of you have already gone off onto other channels. So I’ll come back to that. I’ll go into greater detail about what I've been able to understand of the surprising technology behind all this. Meanwhile, I’m eagerly awaiting the new version on which Wikio will apparently begin to do “teasing” [fr] ;-) So watch this space!


It's confirmed! [fr], a new version is in the starting-blocks.

0 Commentaires:

Enregistrer un commentaire