dimanche, août 14, 2005

Yahoo: Missing pages? (1)

Following the latest announcement by Yahoo about the size of its index (nearly 20 billion pages), various web surfers have noticed that the numbers don't quite add up ... In a comment on my previous post, Béatrice Foenix-Riou points out that, for instance, if we search for the French term azoïque ("azoic"), Yahoo promises us 2380 results, but this figure decreases as we make our way through the pages of results and, in the end, we only get 329 ...

Yahoo tells us: "In order to show you the most relevant results, we have omitted some entries very similar to the ones already displayed. If you like, you can repeat the search with the omitted results included." Google does pretty much the same thing. Since many sites contain identical or virtually identical versions of the same document, this seems like a good idea ... The problem, as Béatrice notes, is that even when you repeat the search, you still only get 576 results. She asks - quite rightly - what has happened to the missing pages (of which there are a mere 1804!) ...

For the same search, Google returns 360 de-duped results (i.e. if we omit "similar documents"), and 623 after repeating the research to include pages with similar content; in both cases, more than Yahoo, despite having an index which Google itself admits is less than half the size of Yahoo's. Charlene Li of Forrester Research draws my attention to the same problem (and develops the idea here), and Aki provides us with a detailed analysis on his blog. The conclusion reached by certain commentators is that Yahoo is "tricking" us too ...

Now, I feel no particular goodwill towards Yahoo, and you must surely have noticed the question mark in the title of my post announcing the increase in the size of its index ;-) I've been wondering about this since March, when Yahoo doubled its figures from one day to the next in an inexplicably perfect manner [here] ... I would be the first to denounce such flagrant trickery if I had solid evidence. But I don't believe that, based on these observations, we can claim that Yahoo is lying to us about the size of its index.

Firstly, let's be clear. The term "index size" is a little ambiguous. When Yahoo announces proudly that it is indexing nearly 20 billion pages or documents, we don't know how many words are being indexed. Yahoo could, paradoxically, be indexing fewer words than a search engine that claims to index 8 billion pages. Yet words are what we type into a search engine and are the sole link between a search engine and the pages themselves ... One of the fundamental reasons for this difference lies in just how big a "slice" of a document is really indexed by the search engine. The Web contains some pretty big documents, and search engines limit their indexing to just a part of these documents, the size of which may vary. Google had a famous limit of 101K, which was abolished in January 2005 [see here] - but no-one really knows what the new limit might be.

This is particularly noticeable when it comes to pdf files (theses, reports, etc.) that may be several hundred pages long. Apparently, Yahoo seems to be indexing a much smaller part of these documents than Google. Take the following example. The search term "azoïque" suggested by Béatrice, in Google, returns a particularly relevant pdf document, a thesis on organic chemistry from the école Polytechnique. This document is not returned by Yahoo for the same search request. Yet the document is in Yahoo's database, as can be seen if we search for its title: "Principes de chimie radicalaire" ("Principles of radicalar chemistry").

The problem is, this word appears for the first time on page 16, after only 15,200 characters, but Yahoo hasn't indexed it. Google, on the other hand, hangs in there until around page 68 (it doesn't find "glycinate" on page 69 but finds "chlorosuccinimide" on page 68, which is 86,600 characters from the start ...) This can be seen quite clearly in the cached HTML version.

Undoubtedly, this explains why Yahoo, even if it stores a larger total number of pdf documents than Google, finds fewer of them for a given search such as "azoïque". In this specific case, Yahoo only finds 77 de-duped pdf documents containing this word compared to 124 for Google. The same is obviously also true for .doc, .ppt and other files.

If we exclude pdf files, Yahoo retrieves as many documents as Google and even a little more :

Total 36033192%

Seach for Azoïque - De-duped

Total 62358694%

Search for Azoïque - With duplicates

So far, all we can confidently state is that Yahoo doesn't index pdf files as well as Google. We can't conclude that Yahoo is lying about the size of its index in terms of the number of documents. Nor, of course, can we confirm this size ;-)

But "Azoïque" is a peculiar search term. Such technical searches tend to produce a greater number of pdf files than is the case with the majority of more common searches. Nonetheless, we still haven't explained why Yahoo changes its estimate about the total number of results so considerably while the results are being displayed. We will look at that in my next post, where I will show how we can't extrapolate observations made on infrequent searches to the index as a whole.


