Search: Crazy duplicates (1)
You’ve probably noticed that I hardly ever talk about Google’s numbers any more. Over the months we’ve seen how the number of results returned was often fanciful at best , and that the whole idea of saying how many pages were indexed was starting to become less and less realistic because of how the web is changing (with spam, duplicate documents, dynamic pages, and so on) [2, 3, 4]. Indeed, Google no longer mentions the size of its index on its home page . Nevertheless, just like its competitors, the search engine does continue to display the number of pages indexed for each search, and from this we can draw some interesting conclusions about the different strategies employed by each of the search engines.
To illustrate this, I would like to tell you about a little experiment that I carried out recently, and which I found to be very instructive. You might remember how, not so long ago [fr], I wrote about the invention of the word ségolisme (= a word which derives from the first name of Ségolène Royal, the rising star of French politics). My post on this subject dates from 1 June, and I believe the word first appeared on the web some time during the month of May. Well, on 1 June, Google was already returning 24,000 results (as Vicnent pointed out in a comment [fr]). A bit much for a newly-coined word, it would seem, and I certainly had my doubts as to whether this was a true reflection of how the word was being used on the web.
As you may know, Google displays a certain number of documents on the first page of results, but as we make our way through the subsequent pages of results (a good tip is to set the number of results per page to 100 in the preferences), we find that we end up with a good deal fewer results than originally announced. So, for ségolisme, Google actually only returned 200 results on 1 June, announcing on the last page:
In order to show you the most relevant results, we have omitted some entries very similar to the 200 already displayed. If you like, you can repeat the search with the omitted results included.I pointed out this issue during the famous battle between Google and Yahoo last summer. Search engines are using more and more elaborate strategies to get rid of duplicates in their search results, which is certainly much better from the user’s point of view! However, the proportion of supposed duplicates is extremely high in some cases, of which ségolisme is just one example. On 1 June, according to Google, the percentage of “similar pages” was 99.2% for this word (23800 / 24000).
Every day throughout the month of June I noted the evolution of these two figures (don’t worry, I have tools which can do this automatically!). The graphs below show the number of results with and without similar pages respectively:
With similar pages
Without similar pages
We can see that the number of documents indexed (provided we believe the figures given by Google, of course) rose to 322,000 on 11 June, and then fell to 52,200 on 1 July, with several fluctuations in between. The number of results without repetition remained under 500 (reaching a peak of 469 of 23 June). The percentage of “similar pages” went up to 99.9% around 10 June, which is quite astonishing.
So, what are all these pages that Google considers to be similar? On closer inspection we can see that they are mainly RSS versions of the same page, archived pages (a blog often presents the same post in an individual version and also in a weekly or monthly archive), versions of posts both with and without comments, and links such as “trackback” or “new comments”, which are also commonplace on blogs. In fact, the word ségolisme was used in a post on Agoravox, and for a time, the thousands of posts on this platform all automatically included the word ségolisme, not to mention the 459 comments, which are rendered on a separate dynamic page (“report an abuse”). On 1 July, the Agoravox website alone was responsible for 15,200 results of the 52,200 returned by Google. Yet a search on the site itself reveals that only 392 documents contain this word, most of them comments; one single post contains the word – the original post!
At Yahoo, the situation is similar, although not quite as bad. The two graphs below correspond once again to the results with or without “similar pages”:
With similar pages
Without similar pages
On 11 June, Yahoo reached a peak of 15,900 pages indexed, almost 20 times fewer than Google. Without “similar pages”, the maximum figure of 474 was reached on 17 June. It’s interesting to note that:
- once duplicates are removed, Google and Yahoo’s figures are about the same;
- Yahoo’s curves are much more stable than Google’s.
The number of pages returned without “similar pages” seems to be an interesting source of data for quantitative studies, since it is apparently closer to the number of truly original documents. One slight problem with that is that the search engines do not provide this result directly on their home page, or in their API, and more complicated tools are required to collect it, with a longer response time. Furthermore, once the number “without similar pages” surpasses 1000 (the maximum number of pages returned by Google and Yahoo), it is simply not accessible.
In any case, this example clearly shows to what extent the problem of duplicates, and new developments on the Web (particularly the growth in popularity of blogs and forums, with their archives, trackbacks, comments, and so on), can have a dramatic impact on the figures provided by the search engines, since there can practically be a factor of 1000 between the number of original documents and the number of results returned. All those who use search engines (Google in particular) for quantitative studies should bear these phenomena in mind. The conclusions drawn from the raw figures may be completely absurd.