Web: Google searching 9,105,590,456 pages [en]
Read also :
- SEO: Google and its image (Déc. 9th, 2007)
As I announced in November, Google doubled its index to eight billion pages, and posted proudly on the home page:
Searching 8,058,044,651 pages
However, at the same time, Google changes the count of individual words. I applied the same requests on 16 words on November 22, 2004 and on January 22, 2005:
Word | Nov. 22, 2004 | 22 jan 2005 |
---|---|---|
Aznar | 1690000 | 1600000 |
Bernadette | 1920000 | 2250000 |
Blair | 14100000 | 15800000 |
Chirac | 3120000 | 3280000 |
Claude | 15600000 | 17900000 |
Coluche | 161000 | 193000 |
Corona | 6750000 | 7430000 |
Jacques | 19000000 | 21400000 |
Jospin | 669000 | 768000 |
Poutine | 272000 | 316000 |
Raffarin | 752000 | 893000 |
Saddam | 11100000 | 12400000 |
Sarkozy | 838000 | 695000 |
Thatcher | 2140000 | 2770000 |
Veronis | 62600 | 60100 |
Zidane | 1090000 | 1280000 |
There is a quasi-perfect correlation between the results obtained at these two dates (determination coefficient> 0,999!):
The slope of the regression line (1.13) gives us the progression between November 22 (very little time after the publication of the index size by Google) and January 22. This enables us to estimate the new size of the index (8,058,044,651 x 1.13). I am thus happy to announce it: Google's index exceeds nine billion pages . Google should thus post:
Searching 9,105,590,456 pages
Why Google doesn't update its home page? If they intend to hide the progression from their competitors, it is rather ridiculous since as this post shows, it can be estimated in a very simple way.
This small annoyance is less serious than the bug on advanced research that I reported the other day, but all these sloppy details end up throwing suspicion on the quality control at the Google house. Of course, for the moment only professionals are concerned with these things. They do not make the smallest difference for requests about yellow pages or Britney Spears (see this post).
5 Commentaires:
Actually Jean, when Google updated the numbers on its front page in November, eagle-eyed watchers (including myself) noticed that for the briefest time, a Google saerch for "the yielded over 10 billion results, before Google smacked it back down to the exact same number it says on its front page (which is of course, statistically impossible). It stands to reason that Google has anywhere from 10-13 billion pages already in its index, but is hiding the number from its competitors.
I just saw your follow up on this topic on InsideGoogle, which I recommend to readers of this post:
http://google.blognewschannel.com/index.php/archives/2005/01/23/google-at-how-many-billion-9-11/
Fascinating, indeed! Many thanks for the additional info.
In a comment on the InsideGoogle's follow up to this story, Philip Lenssen cites a press release from Google which seem to indicate that they consider pdf etc. as Web pages (which is their interest, anyway, if they want to impress the word with large figures):
http://google.blognewschannel.com/index.php/archives/2005/01/23/google-at-how-many-billion-9-11/
In any case, that doesn't change my point. There has been a 13% progression, i.e. ca. one billion pages, which is not reflected in the main count.
More than 50% of the URLs shown in Google results are:
1. URLs without titles, descriptions or content. These are URLs that are restricted via the robots.txt or pages that have never been or will never be fully indexed because of bugs within their indexing system. Example:
http://www.google.com/search?num=10&hl=en&lr=&safe=off&c2coff=1&q=site%3Ausatoday.com+olympics+saltlake&btnG=Search
Almost 50% of these results are empty. Shame on you Google!
2. Supplemental Result - Because of Google's limitation on the total number of URLs they can store in an index, Google now has at least two separate indexes. This is so they can say they are bigger. But the Supplemental index is rarely used. It's just there so they can say they have more URLs than Yahoo!.
In reality Google does not have over 8 billion "indexed" URLs. Yes they possibly have over 8 billion URLs in their index(s), but only a percentage are actually fully indexed pages or pages you can search out and find.
Google will update their logo for all special events (new year, their anniversary, Olympics and so), but the only time they update the "Searching 8,058,044,651 web pages" statement is when they feel threatend like they did when Yahoo! announced the purchase of Overture, AltaVista, Inktomi and bla, bla, bla.
Google's technologies are great in the classroom, but terrible in the real world. PageRank is the easiest algorithm to manipulate. It's also easy to steel another site's PageRank slamming your competition to the bottom of the search results.
Fascinating. Of course the maths is beyond mine, so I'll take your word for it.
I also suspect that indexed does not mean the same as displayed in their results. Apparently Google spider and index large numbers of a site's pages without necessarily displaying them in their results - depending on how particular websites perform. Low bounce rates and more pages suddenly appear... Can't quite work out the rational for it yet.
Enregistrer un commentaire