mercredi, mars 23, 2005

Google: 5 billion "the" have disappeared overnight

When you used to search for "the" in Google with the "any language" option, the result had been exactly 8,000,000,000 for quite a while. Today, if you type "the" again, you are likely to find that 5 billion occurrences of "the" are gone :

The on Google (Web)

It is possible, however, that you will still get the old count if you try. Google has been "dancing" a lot over the last two weeks. Not the usual "Google dance" that we were used to see from time to time, which lasted two or three days while Google was updading their databases. This new dance is a real Saint-Vitus dance: results go back and forth, appear, disappear, and change almost everyday.

What happens is that the Googlers have been pretty embarrassed by my computations in early February (see summary here), which seem to have spread around the planet, and made a lot of noise in the Googleplex. Since then, they seem to have been busy to try to fix the situation and make the numbers look credible. However, this time it involves not only updating indexes, but also major changes in extrapolation routines, Googlean logic, etc. Probably difficult -- and error-prone. Hence the numerous trial and errors that we seem to observe these days.

I'll wait until Google is stable again (if it is someday ;-) to perform a detailed analysis, but we can already get a sense of the direction in which Google is going. I pointed out that when you search for "the " in English pages only, you used to get only around 80 million pages, i.e. 1% of the whole, which did not make sense. Today, I get ca. 2.9 billions, i.e. a ratio of 90% of the whole, which does make sense.

The on Google (English)

It is almost exactly what Yahoo says (respectively 3.87 and 3.52 billions). Interestingly enough, the new results reveal very clearly that Yahoo indexes more pages than Google (see here and here).

Read follow up

25 mar - Google: A snapshot of the update

