Web: Googlean logic [en]
I have said several times on this blog how much I was impressed by Google's developers (see here and here ). However, I have some trouble with their sense of logic, and wonder if their advanced search is so advanced. We all know the "Boolean" operators provided by Google:
- Chirac OR Sarkozy returns the pages containing one or the other keyword or both,
- Chirac AND Sarkozy returns the pages which contain both (the AND is optional),
- Chirac -Sarkozy returns the pages which contain Chirac but not Sarkozy.
|Chirac||3 260 000|
|Chirac OR Sarkozy||1 570 000|
|Chirac||3 260 000|
|Chirac OR Chirac||1 950 000|
|Chirac AND Chirac||1 950 000|
|Chirac Chirac||2 010 000|
One should have the same result in all cases.
|Chirac AND Sarkozy||154 000
|Chirac -Sarkozy||1 950 000|
|-Chirac Sarkozy||320 000
|Total||2 424 000|
However, according to the Venn diagram below, the total of the various results should be the same as Chirac OR Sarkozy, i.e. 1 570 000 (but this is probably already false!).
I don't have the slightest idea of the source of the problem. Of course, I know that the numbers returned by Google are approximations (the engine specifically says 'about x results'), and that the numbers can slightly vary as a function of the "data centers" that process the request and can vary from one time to another. These reasons can explain small differences, but not differences of a factor of two. I've asked in different forums. No one seems to have the solution (if some among you have it, I'll be very curious to know!)
In any case, it is annoying for use in classrooms (the other day I made a fool of myself in front of my students -- ok, I will survive ;-), but it is much more annoying for professional uses, and especially the emerging "Google linguistics".
My advice: it is better use Yahoo! Search for this kind of calculations:
|Chirac||2 219 000|
|Chirac OR Sarkozy||2 450 000|
|Chirac||2 210 000|
|Chirac OR Chirac||2 220 000|
|Chirac AND Chirac||2 220 000|
|Chirac Chirac||2 200 000|
|Chirac AND Sarkozy||205 000|
|Chirac -Sarkozy||1 990 000
|-Chirac Sarkozy||256 000
|Total||2 451 000|
There are still small fluctuations but those I am ready to accept as the result of the algorithm's approximations.
Should we start Yahoo! linguistics ?
24 jan - Mark Liberman just posted a very interesting follow up on Language Log.
28 jan - See new developments : Google's counts faked ?
Libellés : Google