Jean Véronis

lundi, février 28, 2005

Web: MSN cheating too ?

Some time ago, I showed that Google inflates its result counts by a factor of 66%, which explains a number of weird inconsistencies, and in particular the fact that pages seem to disappear as if by magic. When you search for words such as alive, economist, focusing, etc. in English pages only, you get only about 60% of the results Google claims for the entire Web, which is of course impossible, unless these words are also massively non-English. Yahoo behaves in a much more reasonable way, and tells us that these words appear for 92% in English pages. Google seems to artificially inflate the result counts to make them match the size of their main index combined with their supplemental index, although of course this supplemental index contains very little (URLs, titles, etc.) -- and certainly not the English words that are desperately missing.

We've seen Google and Yahoo. What about MSN ? Well... it turns out that there is something fishy there too.

I used the same English word list as in the previous study (and results were obtained at the same time, i.e. February 6th). The figure below plots the counts given by MSN for English pages vs for the entire Web (see complete set of results here) :

The slope of the regression line indicates that the English results represent only 65% of the results for the entire Web. This is a little better than Google (56%), but still does not make sense.

For French, the plot is as follows (see complete results here) :

This shows that only 75% of these French words are located in French pages (Yahoo indicates 97% for the same list).

Does MSN have a supplemental index like Google ? or are the results simply inflated for marketing purposes ? I do not have enough information at this point on MSN's architecture, but maybe some readers have some lights (if so, please comment !).

If we trust Yahoo as a first approximation, we can infer that the "real" index (i.e. in which the page words are indexed) is only about 0.65 / 0.92 = 71% of MSN's claims, if we use the English list as a probe, or 0.75 / 0.97 = 77% if we use the French list.

In conlusion, MSN's index seems only around 75% of the size claimed (what is it by the way? they said 5 billion pages before launching it, I don't remember seeing more precise/recent figures -- again, if you know, please comment !). Consequently, results seem inflated by a factor of 33% (1 - 1/0.75).

Google : 66% inflation ; MSN : 33% inflation. About half. Coincidental ?

In any case, so far only Yahoo's results seem coherent (should I say sincere ?). The irony is that Google probably inflated its count because of MSN's pressure, when MSN announced 5 billion pages, but it seems that MSN if playing a trick too! Search engines playing liar's poker?

2 Commentaires:

Anonymous Anonyme a écrit...

I recently did an interview with MSN's search team and they said that the index is currently "north of 5 billion documents".

Great information - thank you for sharing it with us!

19 mai, 2005 21:07  
Blogger Jean Véronis a écrit...

Many thanks for the info, Randfish. Here is a link to this interesting interview if others are interested.

19 mai, 2005 21:13  

Publier un commentaire