Google: Pages à gogo [Technologies du Langage]

Imagine my surprise when I discovered that Google now indexes 584,000 pages of my professional website! I know I write a lot (too much, some may say) but still, several hundred thousand pages in the space of a few days is beyond even my capabilities ...

When I took a closer look, I soon realised that this sudden massive increase was due to my concordance program for the European Constitution (and the French Constitution) [see English version]. Long-time readers of this blog (yes, I already have some “long-time” readers, at least in blog years) will remember that back in April I wrote a little program for navigating through the infamous “Treaty establishing a constitution for Europe” – our beloved institutions hadn’t thought to provide us with anything other than an all but unreadable, 480-page tome in pdf format ...

You can perform a search by typing a word in the search box, and if you click a letter, you can also see a list of the words that appear in the draft European Constitution and the French Constitution.

You just need to click a word (banque, for instance) to see every passage containing this word ... If you click one of these passages, the relevant page of the draft European Constitution then appears (for example, Article III-159) [same queries in English: bank, Art. III-159].

All of these are virtual pages, generated by a program as and when the queries are made. Nonetheless, Google has diligently followed each of the links, and indexed every single one of them. That’s quite a few pages, believe me! Hundreds of thousands of virtual pages, each containing a range of different fragments from the draft European Constitution and the French Constitution. Yahoo!, on the other hand, is far more conservative that its competitor and does not follow the links. As a result, only 21,900 pages of my site are indexed by Yahoo!, which seems to more or less correspond to my dabblings in HTML over what has now been more than ten years on the web …

I have no way of knowing the overall impact of Google’s new indexation method, but in all likelihood mine is not the only site where this has happened. Would this sudden inclusion of dynamic pages go part of the way towards explaining the enormous leap in Google's index size at the beginning of September, when it increased nearly threefold – leaving aside what it says on its home page (see here)? Thanks to Trendmapper, we can see the dramatic increase for the search query "véronis", for instance (Google is in yellow) -- indeed, Trendmapper shows how the same thing has happened for nearly all search queries:

Needless to say, this has a negative impact on quality. By massively and blindly indexing automatically-generated pages in this way, Google is certainly adding to the “noise” in its index (spam, lists of words, etc), which was already worse than its competitor’s last August, even before this quantum leap (see here). Google’s engineers are smart enough to realise this, and I can’t help think that this sudden opening of the floodgates to allow in dynamic pages is nothing more than a panic move in the (absurd) war over index sizes, coming just after Yahoo announced that its index had reached 19.2 billion pages. I’ve been on the lookout for a shock annoncement from Google, but there has been nothing so far – the home page is still stuck at 8 billion. Make of that what you will.

In any case, all of this provides real food thought. Dynamically-generated pages are becoming more and more common on the web: more and more sites are now managed using CMS (content management systems), such as SPIP, which generate pages on the fly. One of the best-known is Wikipedia, but this is very much a general trend. But how can you tell good dynamic links from bad, and in particular from spam? I didn’t mean any harm with my concordance program, but if I were an unscrupulous SEO, I could just as easily build what is known as a spider trap, which generates random text on the fly in order to trick the robots (or spiders) that carry out the indexing. A fair number of these exist already (although you must forgive me if I don’t give them publicity by adding links).

Of course, statistical techniques allow the worst offenders to be filtered out, as I said when discussing splogs. But I ended that particular post by mentioning how it was becoming more and more difficult to tell spam apart from genuine text, as spammers are learning fast and now avoid making the most blatant statistical errors. In a way, without meaning to, I have built the perfect spider trap: who could claim that the extracts from the draft European Constitution and the French Constitution fail to respect the statistical criteria for “good” texts? All I’d need to do is add some links to a commercial site, or even just live off of my earnings from the Google ads that I could put on my virtual pages. Others have had the same idea, and it is my belief that the fight against web spam will become one of the major challenges of the next few years. If the search engines can’t come up with the right tools, spam may well end up killing off the web as we know it, just as it nearly did with email.

Libellés : Google

A propos de l'auteur

Billets récents

Archives

Outils

Ma startup

Mes livres

vendredi, septembre 23, 2005

Google: Pages à gogo

0 Commentaires:

Cherchez sur ce blog