Yahoo: Missing pages? (4)
NCSA has issued a strong disclaimer on the Google/Yahoo study which has made so much noise a few days ago [original version]. Yesterday the study page read as follows:
The following study was completed by two of Professor Vernon Burton's students at the University of Illinois. Though one of the students previously worked with Professor Burton at the National Center for Supercomputing Applications (NCSA), the study was done outside the scope of any NCSA core projects. When first published online, staff at the NCSA noted several issues with the study, and some revisions have been made to the document to reflect several of these concerns. Changes are detailed at the bottom of this page.Today, a new, revised version has been put online. Interestingly enough, Prof. Vernon Burton has disappeared as a co-author, leaving his two students alone on the battlefield. Affiliations to NCSA have been removed as well.
Please note again that this study is not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA.
A verification study is currently in progress that addresses the presence of "wordlists" and "dictionaries" in the search results that many rightly point out could count as a source of bias. The new study filters out any dictionary or wordlist results. Preliminary results (from 7000 test queries) indicates that the results of this verification study confirms the conclusions of this study, but final results are still forthcoming.
In the new study, the authors still draw two words at random in the ispell dictionary, but exclude a third, random word from the search (using the exlusion operator - ), in the hope of removing word lists and spam from results. For example, they will search for switchers trophoblast -agnus. They find that Google still returns more results (although less often than before).
Unfortunately, this new strategy doesn't remove the bias. Word lists and spam are still returned, as can be easily checked on any of the queries used, such as switchers trophoblast -agnus. Here are the results from a Google search this morning : all results but one are word lists and junk.
Yahoo returns no result on the same query, and thus misses the one interesting document returned by Google [this one]. It turns out that this document is a long pdf file, which is in Yahoo's database [see here], but is not returned because Yahoo indexes long documents less deeply (see discussion in my previous post). The fact that such documents are not returned does not mean that Yahoo lies on the number of documents indexed (which is the question under debate). The authors do not take into account the difference in filtering strategies either (see here).
In conclusion, this new study is just as biaised as the previous one. It still counts numerous junk documents returned by Google, and doesn't address other important issues.
I find it amazing how quickly such a flawed study could be quoted with so much excitement all over the blogosphere and even make its way to the respectable New York Times. Fortunately, a couple of bloggers were on the watch.
Libellés : Yahoo