lundi, mars 07, 2005

Web: Yahoo indexes more pages than Google

Read follow up

9 Mar - Unbelievable! Yahoo doubles its counts
13 mar - Google adjusts its counts

In previous articles, I have shown that both Google and MSN seem to inflate their result counts in a substantial way. The reason for this is probably that the real indexes are much smaller than the figures given by the companies for marketing purposes (60% for Google, 75% for MSN). See:
Yahoo's numbers seem sincere -- they are at least coherent, and do not show the inconsistencies that betray Google and MSN. In this new study, I will show that Yahoo indexes more pages for French than Google, and about the same number for English. MSN is behind for English, but also indexes more pages than Google for French. This is somewhat paradoxical, since, due to Google marketing strategy, most people believe that it is the largest search engine in the competition. We will see here that it is not, when it comes to real indexes, and not a simple database of URLs. Google is actually playing on words. It says on its home page: "Searching 8,058,044,651 web pages", which is probably true, but does not make any claim about really indexing their content. However, as I have shown in the study cited above, for 40% of these pages, Google knows only the URL and has not indexed the words on the page (Google calls this a "supplemental index"). This means that if you type a given word, you have no chance to reach pages that contain it if they are in the supplemental index --unless the word is part of the URL itself.

The strategy I have used is similar that of my previous studies. Lists of medium-frequency English and French words were probed through the various search engines and the counts returned were compared (see detail for English and French).

The scatterplots below show the Yahoo vs Google comparison for English and French respectively:

Yahoo vs Google (English)

Yahoo vs Google (French)
The regression lines (in pink) show that Yahoo and Google return just about the same number of results for English, but that Yahoo returns ca. 1.3 times more results than Google for French.

The next scatterplots show the MSN vs Google comparison:

MSN vs Google (English)

MSN vs Google (French)
The regression lines show that MSN returns less results than Google for English (only 0.8 times). However, it returns ca. 1.13 times more results than Google for French.

Other factors obviously come into play when search engines must be compared (such as the relevance of the result ranking). However the search engines themselves (mainly Google and MSN) have set the competition in terms of index sizes. The ironic part is that the only engine that has not played that game (Yahoo), since it has not released any index figures, seems to outperform both Google and MSN in terms of pure size.

Finally, the results above seem to indicate different company strategies. Since it is likely that Google has technical problems with increasing its index in a major way (this has been noted before: see discussion here ; see also this recent post), the company seems to have chosen to focus on the English-speaking world, whereas both Yahoo and MSN seem to target a wider audience. It would be interesting to make comparisons for other languages (German, etc.) to see if this is confirmed, but I am convinced that it is. Google depends much more than other companies on its search engine for revenues (98% for Google as opposed to 45% for Yahoo -- some analysts begin to see a weakness in this dependence) and its technology for contextual advertising (AdSense) is mainly tuned to English. It mostly comes from the CIRCA technology developed by Applied Semantics, a company bought by Google in April 2003 (see press release). CIRCA makes use of an ontology expanded from the English WordNet. There is no equivalent ontology for other languages at the moment, and, knowing the amount of work that such databases require, I doubt that Google has been able to develop other language ontologies in such a short-time span. The pieces seem to fit together: dependence on search means dependence on ads, and contextual advertising implies focusing on English in the current state of technology.

However, this is a dangerous strategy for Google. The other engines might rapidly gain ground against Google in the non-English speaking world as soon as the users realise that they offer better search than Google for their native tongues.

