Web: Google's counts faked?
[8 feb - Read followup : Google's missing pages: mystery solved?]
A few days ago, I showed that Google's boolean operators are flawed in a major way which makes their result counts totally unusable (unless you are ready to accept that A OR B returns half the number of results of A alone, of course).
However, I've found much more -- and much more disturbing. The counts themselves are flawed in a major way, even if you don't use any "advanced" (or not so advanced) search capabilites. Take a look at these two screen copies, and find the error:
The first screen is a query for the on the entire web (i.e. the part Google claims it's indexing), the second for the, restricted to English pages only. There is a small oddity that was already noticed by many people: the count for the on the entire Web is rounded at 8 billions exactly, which is a bit suspicious. But this is not my point. The query for the in English pages returns only 88 million pages, i.e. just above 1% of the Web total. I have some trouble accepting this result, which would mean that nearly 99% of occurrences of the string the occur in non-english pages.
But I may be wrong. Let's check what Yahoo! says:
The picture is entirely different here, since 91% of occurrences of the are located in English pages, which is much more in line with our intuitions.
I am not ready to accept the standard explanation from Google's people ("you know that our figures are estimates, approximations", etc.). Differences of that magnitude are likely to hide something more important. I therefore tried to assess the exact share of English pages indexed by Google. To do so, I chose 50 "words" that are likely to be language-independent: numbers, file extensions, protocols (http, etc.), computer brand names, etc. The words probably occur in other languages as well, and although there might be some individual variations among the words, I don't expect to see any kind of pattern relating their presence in English pages and their frequency. Or, if there is one, it will have to be explained.
The results are summarised (in millions) in the following table (they were computed on January 25th at Google.com from France, and results may vary a little of course depending on the data centers that are hit):
I plotted the percentage of English pages vs the frequency of words in the entire Web in the diagram below:
This is entirely unexpected, since we can observe a power law linking percentage and frequency, resulting in a very sharp decline in the proportion of English pages containing a given form when the global frequency of that form increases. I am ready to accept a small bias, but I can't see anything explaining an effect of that magnitude. Anyway, I don't want to rely on intuitions, and I checked what Yahoo! says about these same 50 words. Yahoo! and Google recognise about the same set of languages, and might differ a little in their crawling strategies, and therefore the plot can be slightly different, but the overall tendancy should be roughly the same.
However, the pattern is totally different at Yahoo! :
There is no correlation at all, and the words, as I expected, appear randomly in the plot. The regression line is flat, indicating an average proportion of 61% of English pages in their index. There is therefore something weird with Google. I tried to "zoom in" by using a logarithmic scale for the X axis in the Google diagram in order to see if we could have a clearer idea of what's going on, and indeed this new diagram sheds some light on the situation:
The plot clearly divides in two parts somewhere between 107 and 108. The left part behaves exactly like Yahoo! : there is no correlation at all between the word global frequency and the En/Web ratio. The linear regression line is flat (it may look slightly bent because the X axis scale is logarithmic) and its slope indicates a share of about 43% English pages. The power law behavior occurs only in the right part of the diagram, and now that the lower frequency words have been excluded the correlation is extremly strong, with a coefficient of determination R2 reaching 96%. Both the sudden change near 0.5 x 108 and the very high R2 in the second part are difficult to concile with a natural effect. It seems likely that something artificial is going on there.
But what exactly? This is of course difficult to determine. The sudden break around 0.5 x 108 is consistent with Mark Liberman's findings in his follow up to my Googlean logic post. Mark plotted X vs (X OR X) for a number of words (I reproduce his diagram below for the sake of convenience). He noticed a change in slope around 105 (dotted line). However, there is another, more drastic bent around 0.5 x 108, as in my data (I have marked it in pink in the diagram). The same reason(s) could very well be hidden behind the two problems.
Some people have said that Google may have crawled 8 billion pages (or even more, see Nathan Weinberg's post on InsideGoogle), but have not really indexed the entire set for pratical reasons. The real index on which the data center are operating could be much smaller, and in such a case an extrapolation would be done to match the 8 billion figure -- apart from the fact that Google may have old code (see Mark Liberman's and Geoff Nunberg's comments), and they may simply have forgotten to update it in the hesitations that seem to have taken place when they increased their index (again, see Nathan Weinberg's comments for a more complete story).
I don't know if this is the explanation, or even part of it, but I am sure that readers and commentators of this blog will have plenty of ideas (if you write a follow up somewhere else, plese drop me a note at Jean.Veronis@up.univ-mrs.fr).
In any case, I would not recommend professional uses of Google's counts (such as "Google linguistics"). Yahoo! seems more reliable -- or are they simply cleverer?
28 jan - Danny Sullivan writes a useful follow up on the SearchEngineWatch blog, with pointers to other Google's counts oddities:
Search engine counts are never something you should depend on, a topic we've discussed many times before. Still, if you're going to get a count, it's nice if it doesn't seem to change much or simply seem absurd depending on the query you do.
Google's counting has been shaky for ages. But the Web: Google's counts faked? article does a lot of math to find the counts have even more weirdness to them.
[8 feb - Read followup : Google's missing pages: mystery solved?]