Yahoo: Missing pages? (3) [Technologies du Langage]

The debate about the size of Yahoo’s index is heating up; even the New York Times has got involved. An extremely detailed study [original version] carried out by researchers at NCSA, which I wrote about here, seems to provide grist to the mill for sceptics by suggesting the numerical superiority of Google’s index. Nonetheless, I demonstrated in my two previous posts [here and here] that certain errors in its methodology completely invalidate the conclusions reached by this study. In this third section, I will go on to show that even its basic assumptions are wrong.

As I mentioned at the end of my last post, the NCSA authors assume that search engines perform no filtering (for instance to eliminate spam sites) and return all the results in their index for each and every search. If this were not the case, one could not legitimately extrapolate the results obtained from small frequencies (less than 1000) to the index as a whole, since this filtering would certainly not be proportionate to the number of results, and nor would it be identical for each of the search engines under comparison. Yet everything seems to point towards the fact that the search engines do indeed use such a filtering system.

Many web surfers have noticed some strange behaviour from Yahoo. For instance, Béatrice explains in a comment on this post [in French -- English version here] that when we do a search for a term like "azoïque" (which is French for “azoic”, a chemical term), Yahoo initially gives us a total number of results on the first page (2380), then replaces this number with a new, lower figure on each of the following pages, until we end up with a much lower number (576 in this case, if we extend the search to include similar results).

I have tested Yahoo using words within a wide range of frequencies, and this behaviour is systematic. The “loss” rises as the frequency falls:

We could of course put this behaviour down to a bug, or imagine that it might be an attempt to manipulate the data, but it’s so clearly visible that I have trouble believing Yahoo’s developers could be so negligent. Moreover, a similar but less obvious phenomenon also affects Google. The most likely hypothesis is in fact that results are filtered after every search in order to avoid undesirable pages, particularly spam.

Spam is clogging up the web, and search engines are making a major effort to fight it, since it can have an extremely negative impact on the relevance of the results they return. There are two complementary ways of fighting against this plague:

Identify a document or site as spam when it is indexed, and exclude it from the index.
Keep an up-to-date blacklist that allows URLs of known spammers to be excluded after the index has been calculated.

This second technique is especially interesting, since it allows the search engines to react more quickly when new spam is discovered and also allows for dynamic updating without the need to recalculate the index and propagate it across all the search engine’s servers. It is, I believe, this filtering mechanism that we can see at work when pages “disappear” from the total number of results announced.

Needless to say, the search engine doesn’t filter all the results based on the blacklist for any given search! If the user asks for 10 results, it is enough just to apply the blacklist to the beginning of the list of results until 10 valid results are obtained. If the index contains n results, and we have had to eliminate m, a simple rule of three can allow us to display an estimate of the total number of results after filtering, 10 n / (10 + m. The majority of users never request the second screen of results. But if they do, the same mechanism is reapplied, and we then have an improved estimate. Since we know that we have eliminated a total of m' documents, we can display the new estimate of 20 n / (20 + m'). And so on and so forth – easy as pie. Of course, search engines obviously use more complicated functions that the rule of three, since there is no reason why the proportion of spam should remain constant from one screen of results to the next: listing results by relevance even suggests that there ought to be less spam towards the top of the list.

It is exactly this mechanism that we can see at work with Yahoo and Google. The fact that the proportion of pages filtered decreases along with the frequency of the key word is perfectly logical. There are two factors that contribute to this. On the one hand, spam sites make intensive use of dictionaries and lists of random words to produce artificial texts that attempt to fool the search engines. In doing so, these artificial texts use a proportion of uncommon words that is far above the norm. Additionally, listing results by relevance implies for high frequency searches that the pages or documents at the top of the list are probably not spam, as I just mentioned.

Most astonishing of all is that the results published by the NCSA researchers themselves very clearly show this filtering mechanism at work! In their table 3, they show that the percentage of real results returned by Yahoo for the whole of their searches is only 27% (i.e. 73% filtering), compared to 92% for Google (8% filtering). I quote [and you can find the whole study here]:

Table Three (n=10,012)

	Estimated Search Results (Excluding Duplicate Results)	Total Search Results (Excluding Duplicate Results)	Percent of Actual Results Based on Estimate	Estimated Search Results (Including Duplicate Results)	Total Search Results (Including Duplicate Results)	Percent of Actual Results Based on Estimate
Yahoo!	690,360	146,330	21.1%	821,043	223,522	27.2%
Google	713,729	390,595	54.7%	708,029	651,398	92.0%

Yahoo applies much more filtering than Google; undoubtedly their blacklist is more complete. In fact, for the test searches in question, Yahoo returns far less junk than its competitor. While this is very much to the credit of the search engine, it does mean that this mechanism makes it impossible to extrapolate the results observed to the size of the index as a whole. So, here we have a third reason that allows me to state that the NCSA researchers have proved nothing at all, other than that Google does a very good job of indexing spam and ispell!

These fellow researchers of mine must have been very upset to hear my criticisms (they can't ahve missed the pointer through the New York Times), along with those of certain other bloggers (here). While I was writing this post, Serge Courrier [who was interviewing me for 01net] brought to my attention that they have modified their page and removed this qualifying remark about their methodology with regard to filtering. So they have obviously realised their error, but rather than cancel their study (anyone can make a mistake, after all), they have chosen instead of remove this carefully phrased remark that did them credit. It’s one way of doing things, I suppose. But not the one I would have chosen ...

Update

22 Aug -- NCSA's staff admits the flaws and issues a strong disclaimer
23 Aug -- A new, revised (but still biased) study is put online

Read details: Yahoo: Missing pages? (4)

Libellés : Yahoo

A propos de l'auteur

Billets récents

Archives

Outils

Ma startup

Mes livres

jeudi, août 18, 2005

Yahoo: Missing pages? (3)

0 Commentaires:

Cherchez sur ce blog