Jean Véronis
Aix-en-Provence
(France)


Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

mercredi, janvier 19, 2005

Web: Googlean logic [en]


I have said several times on this blog how much I was impressed by Google's developers (see here and here ). However, I have some trouble with their sense of logic, and wonder if their advanced search is so advanced. We all know the "Boolean" operators provided by Google:
  • Chirac OR Sarkozy returns the pages containing one or the other keyword or both,
  • Chirac AND Sarkozy returns the pages which contain both (the AND is optional),
  • Chirac -Sarkozy returns the pages which contain Chirac but not Sarkozy.
First surprise:


QueryResults
Chirac3 260 000
Chirac OR Sarkozy1 570 000
The number of pages which contain Chirac or Sarkozy, or both, should be at least equal to the number of pages containing Chirac, but it is lower than half!

Second surprise:

QueryResults
Chirac3 260 000
Chirac OR Chirac1 950 000
Chirac AND Chirac1 950 000
Chirac Chirac2 010 000

One should have the same result in all cases.

Third surprise:

RequestResults
Chirac AND Sarkozy154 000
Chirac -Sarkozy1 950 000
-Chirac Sarkozy320 000
Total2 424 000

However, according to the Venn diagram below, the total of the various results should be the same as Chirac OR Sarkozy, i.e. 1 570 000 (but this is probably already false!).




I don't have the slightest idea of the source of the problem. Of course, I know that the numbers returned by Google are approximations (the engine specifically says 'about x results'), and that the numbers can slightly vary as a function of the "data centers" that process the request and can vary from one time to another. These reasons can explain small differences, but not differences of a factor of two. I've asked in different forums. No one seems to have the solution (if some among you have it, I'll be very curious to know!)

In any case, it is annoying for use in classrooms (the other day I made a fool of myself in front of my students -- ok, I will survive ;-), but it is much more annoying for professional uses, and especially the emerging "Google linguistics".

My advice: it is better use Yahoo! Search for this kind of calculations:

Test 1:


QueryResults
Chirac2 219 000
Chirac OR Sarkozy2 450 000
Test 2:

QueryResults
Chirac2 210 000
Chirac OR Chirac2 220 000
Chirac AND Chirac2 220 000
Chirac Chirac2 200 000

Test 3:

QueryResults
Chirac AND Sarkozy205 000
Chirac -Sarkozy1 990 000
-Chirac Sarkozy256 000
Total2 451 000

There are still small fluctuations but those I am ready to accept as the result of the algorithm's approximations.

Should we start Yahoo! linguistics ?


Post-Scriptum


24 jan - Mark Liberman just posted a very interesting follow up on Language Log.

28 jan - See new developments : Google's counts faked ?

Libellés :


7 Commentaires:

Blogger Derek a écrit...

Hi there,

I'll post what I think sounds logical to me. After seeing your results from Google, I thought, hey, why not type it in again to see what numbers I get in return today.

In Google Chirac returned 3,250,000 while Chirac AND Chirac returned 2,140,000, an increase since your last search. My hypothesis is that Google is indexing sites not as fast as we hoped, or they would like or claim, if you will.

Notice the results are also very different. I don't understand french so I can't explain some of the results returned :)Even the news returned in Google on both terms vary slightly. One possible explaination could be the density of the word "Chirac" itself within the documents. Maybe there are not as many web pages that contain "chirac AND chirac" as opposed to just "chirac."

But the most plausible explanation I would imagine is the speed at which Google is indexing these sites...

Yahoo has 2,219,000 for Chirac while Google has over 3 million...so can we assume that the difference in the results returned means Google has indexed more junk websites just to increase their index?

And the search continues...

derek@organic-rankings.com

28 janvier, 2005 16:18  
Blogger randfish a écrit...

Very insightful stuff. I'd like to see this get picked up by Google and addressed. It's interesting to note that Yahoo! is having success with the accuracy of their results while Google is not.

MSN clearly is having difficulty with this problem too - their new Beta search is wildly inaccurate with result counts. If this is an industry-wide problem, there must be a good explanation...

28 janvier, 2005 23:12  
Anonymous Anonyme a écrit...

the boolean operators aren't applied to the search engines in a strict way due to their cost and some other factors
When you see result 1 of x.000.000 the second number is NOT the total documents matching your criteria, but the selected documents which you criteria has been applied.

About the "OR" operator it's quite different from the boolena "OR" because it's very expensive in terms of memory and cpu (the indexes can't be used effectively and the data set to scan became quickly enormous)
For this reason the OR operator (in few search engines it works...) is applied just to a small set of pages.

Just try to search house, than cat and then house or cat :)

I hope it's more clear now

29 janvier, 2005 09:38  
Blogger Jean Véronis a écrit...

I understand the difficulty of list union or intersection, the need for estimates, etc. My point is precisely that the number of "selected documents", obtained for whatever internal reasons, is misleading for the user. Yahoo and MS Search do a much better job at estimating these counts. It is therefore feasible. Try house OR cat on these engines, and you'll see that the behavior is fine. In addition, there is no reason at all to give less results for X AND X than for X alone. I'm inclined to think that this is just old code in Google, that needs some serious rewriting ;-)

29 janvier, 2005 10:02  
Anonymous Anonyme a écrit...

This is an old concept, studied previously.

It is nicknamed "Google Flux" and has to do with the way their massive databases communicate. It causes some pages to seem to "wink out" of existance and back in from time to time, causing webmasters much nervousness.

Search for the term and you can find much more background and examples.

30 janvier, 2005 18:55  
Blogger Jean Véronis a écrit...

My understanding is that "Google flux" qualifies the oscillations over a short time span for individual sites. New sites are included in the index, and seem dropped a few days later, and then they reappear.

My observation here is different. It is systematic, over long periods of time, affects huge amouns of results, and does not depend on the query keywords. It looks more like a bug or very bad estimation algorithm.

30 janvier, 2005 19:54  
Anonymous mF a écrit...

Found that funny Flash app to illustrate your post.

01 mars, 2005 11:17  

Enregistrer un commentaire