vendredi, février 10, 2006

Web: A short study in pornometry (2)

I demonstrated in my previous post how Google has a very particular vision of what it considers to be pornographic pages. Google has clearly gone from not doing enough (see here [fr]) to doing just a little too much … What’s more, the intensity of its filtering seems to differ quite considerably from one language to another. A more careful study shows that Google considers the French version of the European Constitution to be “unsafe”, but not the English version. This observation led me to compare the behaviour of several search engines when dealing with these two languages, French and English.

The search engines I used were the same as the ones looked at in the comparative study that I carried out with my students in Aix-en-Provence (see [fr] 1, 2, 3, 4, 5) and for which I will give you the final results in the coming days (if you can bear the suspense …!) I looked at the three American “giants”, Google, Yahoo! and MSN, and three French search engines, Exalead, Voilà and the highly experimental Where possible, I compared the behaviour of these search engines when dealing with French and English: for each of the two languages I randomly selected 150 words (making sure that none of them accidentally had sexual connotations). I calculated the percentage of pages suspected by each engine of being pornographic (searches were limited to each language). The averages can be seen in the diagram below.

This diagram tells us a lot. Most striking of all is undoubtedly the clear difference between the two languages. The search engines behave far more regularly when dealing with English (although the highest figure is still almost double the lowest). For French, however, the results run from 2% to Exalead to 10% for Google. Does this mean that certain search engines (in particular Exalead) are less effective at filtering pornographic content than others? Not necessarily. For “normal” searches such as those used in the study carried out with my students [fr], the behaviour of each engine was remarkably similar. The filter is extremely powerful for all the engines studied: in total, out of 4200 results returned, only one or two were frankly pornographic (with another handful where it’s open to debate, such as a few risqué exchanges on forums).

The other striking discovery is how a single engine may treat the two languages very differently. MSN, and especially Google, filter far more pages in French than in English. This is particularly apparent with Google, which goes from 3.5% to 10.0%. Conversely, Exalead goes from 2.0% for French to 5.6% for English. Yet I can’t see any particular difference in passing from one language to another on the same search engine.

It seems to me that the explanation for these differences is twofold. Firstly, the search engines undoubtedly go too far: since they are unable to work with the level of delicacy required (it’s difficult, I admit!), they have a tendency to overfilter, perhaps using criteria that go beyond simple lexis (as is clearly the case for the European Constitution with Google). This is a general trend, particularly with Google: under pressure from the web-surfing public, filters were put in place very quickly, and apparently, the only way to make a filter work without a particularly discriminating linguistic technology behind it is to bring out the biggest ladle you can find and skim off a lot more than just the cream. I have mentioned this type of problem before when discussing splogs (here and here).

The other part of the explanation comes from the fact that, in terms of linguistic competences, the different search engines vary considerably. I’ve already had cause to mention that Google doesn’t seem to be very good at handling languages other than English (for instance here). The results above would seem to confirm this. Conversely, we can see how Exalead, which is a French search engine, is better with French than with English. Yahoo! is more or less stable from one language to the other.

In any case, that 10% of all French pages disappear from Google when the SafeSearch filter is on seems to smack of overkill. With such a strategy, we are more or less certain not to be troubled by porn-spam, but how many perfectly legitimate sites and documents will also fall into our trap? Of course, it’s mainly sites with a low PageRank that are affected (which is undoubtedly why no-one has protested), but still …

