mardi, août 16, 2005

Yahoo: Missing pages? (2)

Since I published the first part of this study, the affair of Yahoo's missing pages has caused quite a stir. Google has announced that its researchers don't believe the figures announced by its competitor (see here), and a detailed study carried out by the NCSA (University of Illinois at Urbana-Champaign) seems to confirm quite clearly the phenomenon that I described in my previous post: for searches that return fewer than 1000 pages, Google systematically returns more results than Yahoo, which seems to contradict the idea that Yahoo's index is two and a half times the size of Google's [23 Aug -- The NCSA has issued a strong disclamer and the study has been revised; see original version and details].

Unfortunately, the study carried out by the researchers at the NCSA has several shortcomings. Firstly, as I showed in my previous post, Yahoo's indexing of long documents is nowhere near as deep as Google's. As a result, even if Yahoo is not lying about the size of its index in terms of the number of documents, this could partly explain the smaller number of documents returned for certain search requests. Sometimes, the document may well be in the database, but it cannot found by key words that do not appear at the beginning of the document. This is the case, for instance, for the pdf document "Depression and soul-loss" in pdf format, which is returned by Google when searching for inabilities hydrocephalic, but which is not returned by Yahoo for the same search, despite the fact that it is in Yahoo's database (see here).

However, the NCSA study contains an even more worrying error in its methodology, which completely invalidates its conclusions. The authors chose words at random from the compter dictionary ispell and typed them in pairs into the two search engines. This is an absurd strategy, for the chances of real documents containing two words chosen at random from a very large dictionary are virtually zero. The researchers in question are almost certain to find more artefacts (lists of words and spam) than anything else. If one of the two search engines produces fewer of these, we can but salute its filtering mechanism; in no way can we extrapolate these figures to make comments about its behaviour in general and about the size of its index.

We can see, for instance, that for the first search carried out by the NCSA researchers - carbolization clambers - the only results returned by Google (and which Yahoo does not find) are pages consisting of simple lists of words, most of which seem to be copies of the ispell dictionary itself.

The following document is a typical example:
It consists of a 1.3 MB file containing 134,175 words that seems to be a copy of ispell. It is not returned by Yahoo for the same search and indeed doesn't seem to figure in the Yahoo database. The Yahoo database, on the other hand, does contain five other (apparently identical) documents that Google does not contain (found via the search wspears dictionary
It is interesting to note that these documents are the only ones among the 29 returned by my search that are not indexed in the Yahoo database, which only includes their URL. Either Yahoo recognises, for instance from a signature calculation, that this is the ispell dictionary, or else it has a filter that allows it to detect documents that are merely lists of words (which is not too difficult to imagine). This is a perfectly intelligent behaviour, and much to the search engine's credit.

Readers can consult the list of search terms provided by the authors, and can see for themselves that, in the vast majority of cases retained (i.e. those with fewer than 1000 results), the results in question are lists and spam. Results that prove to be an exception to this rule, such as cultist email, have been eliminated by the authors because they return more than 1000 results.

By carrying out their research in this way, the NCSA researchers have shown just one thing: that Google has a greater capacity to index lists of words, including the ispell dictionary, and spam. In no way does it prove that the Yahoo index is smaller (in terms of number of documents indexed) than that of Google.

Quite the contrary; if we look at the same sites as those where Yahoo "forgets" the copies of ispell, we can see how it generally indexes a far higher number of relevant documents than its competitor. For example, on the site mentioned above, Yahoo announces 1630 results for the search wspears, and I checked that the first 1000 really do exist. Google only returns 289 (or 249 if we exclude "similar results"). In fact, from about the 200th result onwards, the results returned are simply URLs where the content is not indexed, while the first 1000 in Yahoo are all indexed. Here, we have a factor of 5 to 1 in favour of Yahoo ...

The NCSA study contains another considerable bias, which the authors themselves are aware of, since they quite wisely present their working assumptions right at the beginning of their article:
The study operates under two working assumptions. The first is that both the Yahoo! and the Google search engine return all the results that match the particular keywords and does not do any filtering beyond removing duplicate results.
The thing is, everything seems to suggest that these conditions are not respected. I will demonstrate, in the third part of this article, how this problem invalidates the NCSA study and others of a similar nature.


18 Aug -- Very interestingly the authors have just modified their text and have deleted the phrase "and does not do any filtering beyond removing duplicate results"... [thanks to Serge Courrier who alerted me about this modification]


