Jean Véronis
Aix-en-Provence
(France)


Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

mercredi, janvier 26, 2005

Web: Google's counts faked?




[8 feb - Read followup : Google's missing pages: mystery solved?]



A few days ago, I showed that Google's boolean operators are flawed in a major way which makes their result counts totally unusable (unless you are ready to accept that A OR B returns half the number of results of A alone, of course).

However, I've found much more -- and much more disturbing. The counts themselves are flawed in a major way, even if you don't use any "advanced" (or not so advanced) search capabilites. Take a look at these two screen copies, and find the error:





The first screen is a query for the on the entire web (i.e. the part Google claims it's indexing), the second for the, restricted to English pages only. There is a small oddity that was already noticed by many people: the count for the on the entire Web is rounded at 8 billions exactly, which is a bit suspicious. But this is not my point. The query for the in English pages returns only 88 million pages, i.e. just above 1% of the Web total. I have some trouble accepting this result, which would mean that nearly 99% of occurrences of the string the occur in non-english pages.

But I may be wrong. Let's check what Yahoo! says:





The picture is entirely different here, since 91% of occurrences of the are located in English pages, which is much more in line with our intuitions.

I am not ready to accept the standard explanation from Google's people ("you know that our figures are estimates, approximations", etc.). Differences of that magnitude are likely to hide something more important. I therefore tried to assess the exact share of English pages indexed by Google. To do so, I chose 50 "words" that are likely to be language-independent: numbers, file extensions, protocols (http, etc.), computer brand names, etc. The words probably occur in other languages as well, and although there might be some individual variations among the words, I don't expect to see any kind of pattern relating their presence in English pages and their frequency. Or, if there is one, it will have to be explained.

The results are summarised (in millions) in the following table (they were computed on January 25th at Google.com from France, and results may vary a little of course depending on the data centers that are hit):

GoogleWebEn%
14780671,4
www441050,21,1
2005240063,92,7
0218080,73,7
10214066,13,1
html160058,93,7
http135034,22,5
web98842,34,3
php88360,76,9
htm84653,56,3
200074762,98,4
10053657,210,7
pdf41753,112,7
yahoo27728,210,2
linux22231,714,3
jpg22132,414,7
mp321343,520,4
amazon20834,616,6
url20236,217,9
microsoft18724,913,3
100015741,726,6
google1501812
xml11924,920,9
xp10124,724,5
ibm81,625,731,5
txt8026,733,4
ftp7731,641
href74,124,132,5
perl51,42242,8
https49,321,543,6
gnu43,319,845,7
mozilla34,413,940,4
mpeg28,712,844,6
macintosh28,115,555,2
firefox23,610,444,1
wma15,55,0732,7
wav13,57,3654,5
ppt137,3456,5
altavista11,84,1935,5
rtf11,46,0853,3
ldap6,983,5651
csv5,822,8949,7
sgml5,232,5849,3
gopher2,921,5252,1
vba2,571,662,3
0x002,210,4219,1
ie62,050,7335,6
vb61,10,436
ffff1,070,437,3
0xff1,070,3229,8

I plotted the percentage of English pages vs the frequency of words in the entire Web in the diagram below:



This is entirely unexpected, since we can observe a power law linking percentage and frequency, resulting in a very sharp decline in the proportion of English pages containing a given form when the global frequency of that form increases. I am ready to accept a small bias, but I can't see anything explaining an effect of that magnitude. Anyway, I don't want to rely on intuitions, and I checked what Yahoo! says about these same 50 words. Yahoo! and Google recognise about the same set of languages, and might differ a little in their crawling strategies, and therefore the plot can be slightly different, but the overall tendancy should be roughly the same.

However, the pattern is totally different at Yahoo! :



There is no correlation at all, and the words, as I expected, appear randomly in the plot. The regression line is flat, indicating an average proportion of 61% of English pages in their index. There is therefore something weird with Google. I tried to "zoom in" by using a logarithmic scale for the X axis in the Google diagram in order to see if we could have a clearer idea of what's going on, and indeed this new diagram sheds some light on the situation:



The plot clearly divides in two parts somewhere between 107 and 108. The left part behaves exactly like Yahoo! : there is no correlation at all between the word global frequency and the En/Web ratio. The linear regression line is flat (it may look slightly bent because the X axis scale is logarithmic) and its slope indicates a share of about 43% English pages. The power law behavior occurs only in the right part of the diagram, and now that the lower frequency words have been excluded the correlation is extremly strong, with a coefficient of determination R2 reaching 96%. Both the sudden change near 0.5 x 108 and the very high R2 in the second part are difficult to concile with a natural effect. It seems likely that something artificial is going on there.

But what exactly? This is of course difficult to determine. The sudden break around 0.5 x 108 is consistent with Mark Liberman's findings in his follow up to my Googlean logic post. Mark plotted X vs (X OR X) for a number of words (I reproduce his diagram below for the sake of convenience). He noticed a change in slope around 105 (dotted line). However, there is another, more drastic bent around 0.5 x 108, as in my data (I have marked it in pink in the diagram). The same reason(s) could very well be hidden behind the two problems.



Some people have said that Google may have crawled 8 billion pages (or even more, see Nathan Weinberg's post on InsideGoogle), but have not really indexed the entire set for pratical reasons. The real index on which the data center are operating could be much smaller, and in such a case an extrapolation would be done to match the 8 billion figure -- apart from the fact that Google may have old code (see Mark Liberman's and Geoff Nunberg's comments), and they may simply have forgotten to update it in the hesitations that seem to have taken place when they increased their index (again, see Nathan Weinberg's comments for a more complete story).

I don't know if this is the explanation, or even part of it, but I am sure that readers and commentators of this blog will have plenty of ideas (if you write a follow up somewhere else, plese drop me a note at Jean.Veronis@up.univ-mrs.fr).

In any case, I would not recommend professional uses of Google's counts (such as "Google linguistics"). Yahoo! seems more reliable -- or are they simply cleverer?



Post-scriptum


28 jan - Danny Sullivan writes a useful follow up on the SearchEngineWatch blog, with pointers to other Google's counts oddities:
Search engine counts are never something you should depend on, a topic we've discussed many times before. Still, if you're going to get a count, it's nice if it doesn't seem to change much or simply seem absurd depending on the query you do.

Google's counting has been shaky for ages. But the Web: Google's counts faked? article does a lot of math to find the counts have even more weirdness to them.

More...



[8 feb - Read followup : Google's missing pages: mystery solved?]

15 Commentaires:

Anonymous Anonyme a écrit...

Ce commentaire a été supprimé par un administrateur du blog.

26 janvier, 2005 14:09  
Anonymous Anonyme a écrit...

try this: http://www.google.com/search?hl=en&q=the&btnG=Hledat&lr=lang_cs

it says that ~4M of pages written in czech contain "the"-word and that's about 25times lower (at 4% of the number of english pages) and that makes it also interesting :)

-- spaze\/exploited\/cz

28 janvier, 2005 03:50  
Blogger Jean Véronis a écrit...

It's way too much! Yahoo gives a better estimate:

2520000 pages containing "the" in Czech for 1720000000 in English, i.e. 0,15%.

The situation is worse for French, since the number of pages containing "the" is 25% the number of English pages if we trust Google (it's only 1% in Yahoo).

Thanks for this observation!

28 janvier, 2005 08:44  
Blogger Hilton Santos a écrit...

I agree with your research.

Seems that they index and count sub folders of any given domain... Thus coming to 8 billion...


http://hilton-santos.blogspot.com

29 janvier, 2005 11:43  
Anonymous Anonyme a écrit...

Ce commentaire a été supprimé par un administrateur du blog.

09 février, 2005 06:08  
Anonymous John Daniels a écrit...

Hello,

I am glad this public Google sucks thread is here.

I have had my head so far up googles A$$ for so many years I feel like a fool. It is now 2005 and it seems for at least a year google's search results are crap. It is like I have to search longer and longer to find what I am looking for.

Like one of the other posters said, you should be able to type in a manufactures name and their website come up first.

This is just and example:

I own a small business in the US named Excalibur Gate Openers LLC. We manufacture gate openers to automatically open swing gates. The website name is www.excaliburgateopeners.com and the title of the web page is the same.

You would think you could type in Excalibur gate openers as a search string and the search engine would bring up www.excaliburgateopeners.com first.

This was my awaking call, MSN, Yahoo and Teoma all bring up www.excaliburgateopeners.com when searching for excalibur gate openers.

GOOD BY FOREVER GOOGLE, YOU SUCK!


Google is my homepage for all my computers......O, I guess the key word should be was:-)

09 avril, 2005 22:52  
Blogger darth-google a écrit...

One of the first Google searches I ever did was an exact-phrase "jill hennessy" search.

HA HA HA HA HA HA HA HA HA !!!

Big mistake. Left a rotten taste in my mouth, ya know? A "Jill Hennessy" search is a great example of when a few, good sites compete with thousands of keyword traps and other garbage; in other words, competition NOT with like sites.

18 avril, 2005 20:50  
Anonymous Anonyme a écrit...

I think Darth Google may have been hammered when he wrote that. I'm going to read up on the topic of your post and make him feel very stupid.

A Friend

18 mai, 2005 04:23  
Anonymous Anonyme a écrit...

Have you checked the google count index lately? I just did and it provided me with a search of "the" with a count of 3.46 billion of the entire web and 3.36 billion of the English pages. Seems they haved cleverly fixed this problem.

11 juillet, 2005 07:26  
Blogger David Burdon a écrit...

Highly stimulating.

David Burdon - Simply Clicks

26 juillet, 2005 22:45  
Anonymous Anonyme a écrit...

I found a new one, not big, but no ads either !

http://www.foook.com

they also index your site while you watch!, quite cool.

03 août, 2005 12:49  
Anonymous Anonyme a écrit...

-yo

18 août, 2005 21:49  
Anonymous Yevgeniy a écrit...

Well... It all just smells like people start to dislike google... However, I must admit that Google is not very good at countings.

22 septembre, 2005 12:07  
Anonymous Anonyme a écrit...

I like the fact that this blog is using blogger.com, owned by critised Google...

22 octobre, 2005 15:33  
Blogger mohnkhan a écrit...

Intresting research done.
well I am now doing my own part on it.

Have a look at this article too
http://www.google-watch.org/dying2.html

signed
mohnkhan


Mohiuddinkhan Inamdar
http://www.mohitech.com

26 janvier, 2007 01:29  

Enregistrer un commentaire