Jean Véronis
Aix-en-Provence
(France)


Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

vendredi, juin 26, 2009

Wikio: Over 100,000 UK blogs



I've been quiet recently. I've been working flat out on a project that has required all of my attention: increasing the number of UK blogs for Wikio UK (www.wikio.co.uk). The UK site was the last one to appear after wikio.fr, wikio.it, wikio.es, wikio.de and wikio.com, and has to some extent always suffered a little in terms of increasing the number of sites in the database. I thus put in place some adapted algorithms several weeks ago and I'm happy to announce that the UK site has now passed 100,000 blogs. Exactly 113,000 at the time of writing, and this number is set to increase further in the coming hours: there are nearly 30,000 more blogs in the pipeline.



If you go to the site you will see "Live breaking news from 156920 blogs", but this is simply the number of anglophone blogs, and not only those from the UK. The same number is indeed shown on wikio.com. Both sites draw from the same database but do not display the same results: it's all a question of weighting. The UK site prioritises UK news and the US site prioritises US news (hence the need to geolocate sources). You will see for example the differing reactions to international events, be it the situation in Iran, or the death of Michael Jackson - all rather interesting.

It is alas very complicated in practice. It is extremely difficult for our machines to determine whether a site is American or British (or Canadian or Australian etc.). Obviously if the URL ends in .co.uk, there is little ambiguity. But this is in fact rarely the case. Most British blogs for example are on blogspot.com, wordpress.com, etc.

The algorithms are rather sensitive, and as far as I'm aware, no other service goes as far to distinguish between UK/US in the way that we do at Wikio. If you try Google Blogs Search or Technorati, you will see for example that it is a mish-mash without any real attempt to sort by country except a (probable) bias towards .co.uk. domains.

The difficulty comes from the fact that no one criterion suffices unto itself. We can, for example, check the spelling. We know that in Britain they write colour or neighbour and not color and neighbor as in America. This can be useful, but it does not in fact concern that many words, and we are not guaranteed to find them on your average blog. To further complicate matters, Canadians, Australians and other blogs of the Commonwealth use the British spelling style. So we can also turn to the blogger's profile: if it cites "London, UK", there you have it. But there is very often not a profile on the page, and it must be found and correctly parsed by the machines. Web 2.0 it appears lacks certain standards! So in practice this requires a fair bit of work...

We can also look at the topology of the blogosphere (I hope soon to be able to show you some maps of the US/UK à la Wikiopole FR). UK blogs tend principally to reference UK blogs, and the US blogs US blogs. The web is simply a sum of communities... However, in pratice it's a little trickier than that: UK blogs also reference US blogs (yet this tends not to happen in the opposite direction, which does help a little).

So, in order to end up with a reliable sourcing technique, one must combine all these criteria, and let me assure you it has not been simple. But I am rather pleased with the results, both in terms of coverage and reliability. The UK site is now the second biggest in terms of the number of blogs. I hope it will be useful for you if you are interested in British culture, and wish to discover blogs from across the channel. I would have loved that when I was learning English at school (we had only the BBC on short wave radio...). The themed rankings are still somewhat light, but I am currently working furiously on this with a team of Masters students whom Wikio kindly granted internships, and we are already seeing some great categories emerging. I don't know whether some (perhaps Wine & Beer) will see the light of day for the next ranking, but if not, it will be at the end of July.

That is also a real challenge: as reliably as possible categorising hundreds of thousands of blogs. It's not simple: a nice example of intermingled semantics and topology. That, however, will be the subject of another post. I don't wish to wear you all out!

Libellés :


0 Commentaires:

Enregistrer un commentaire