Splogs (a newly-coined word made up of
spam +
blog) are to blogs what spam is to email… Annoying little things designed to sell you Viagra or a whole host of other, equally suspect, services. How they work is quite simple: you open a free blog (or hundreds of them) which you stuff full of dummy text and – most importantly – links to the real site where you plan to sell us something (or, more probably, rip us off). Then all you have to do is wait until Google comes calling and, since Google is very good at indexing
spam ;-), the customers will soon start pouring in...
Here’s a typical example (the title of each post is a link to a .biz site):
Blogger is obviously a major source of splogging. Free, easy to set up and fill using automated procedures, and well indexed by Google (see
here [fr]) – when you learn that Blogger is a subsidiary of Google, you may well wonder if Google isn’t giving Blogger a little helping hand here (just compare the positioning of sites on blogspot.com with those on Yahoo or MSN). But the great paradox is that, in doing this, Google
is polluting itself by generously indexing the splog generated by Blogger...
I’ve just read (a little late, I admit)
an extremely interesting post by
Philip Lenssen (
Google Blogoscoped) in which he carries out a survey of fifty Blogger blogs and discovers that
60% of them are spam! I expected the proportion to be high, but not this high – frankly, I’m flabbergasted. If we were to venture an attempt to extrapolate this figure, it would mean that of the 32,700,000 pages Google claims to have indexed on the blogspot.com domain (Philip says 7,500,000 but
this search gives me a lot more), more than 20 million of them would be spam.
Google seems to have realised that it was shooting itself in the foot with this affair, and apparently measures have been taken. At the end of August, Blogger added a
“Flag?” button to the navigation bar that (usually) appears at the top of each blog, allowing visitors to report sites that seem to be spam.
This button seems suspect to me, for two reasons. Firstly, it allows for
co-ordinated attacks against blogs that might upset a certain group or community... which
makes me shudder just to think about it [thanks to
Nathan Weinberg for the link]. But, more importantly, this button is
completely useless, since it’s a simple task to just remove the Blogger navigation bar altogether (as I’ve done here on this blog by way of demonstration!). Sometimes I wonder … Google and Blogger pay their researchers and engineers a lot of money to come up with this sort of thing. It never ceases to amaze me.
But more seriously, Blogger (who must have some good engineers as well) seems to have put in place an effective
anti-splog filtering system.
Island Dave points out that when you click on Blogger’s “Next Blog” button, you no longer land on a spam-filled page. This is confirmed by
Blogger, who claims to have “put some
Artificial Intelligence to work”, no less!
As far as Artificial Intelligence is concerned, the procedures for detecting spam are quite well known. Here’s one, for instance, that I use in my classes to explain some basic notions about the distribution of words in texts,
Zipf’s law (which I will no doubt return to one day), etc.
Take a text, any text.
The Little Prince, for instance (and don’t bother looking, it’s not on the Web because it’s not out of copyright). Calculate the number of words. Hang on, there’s a problem with the ambiguity of the word
word … Does the sentence “
The Little Prince draws the little sheep” contain 7 words or 5? Why both, my dear Watson! There are 7 words separated by spaces, but only 5
different words. To differentiate between the two, we talk of
tokens in the first case, and
types in the second: 7 tokens, 5 types.
Now that we’ve cleared up this matter of words, let’s get to work. Using, for example, my (free) program
Dico, we can see that
The Little Prince [the original French version] contains 15,352 tokens and only 2412 types. This provides a
type/token ratio of 0.16. Let’s look now at the cooking splog I used as an example at the beginning of this post. It has a type/token ratio of just 0.015 - ten times less! Why? It’s quite simple really. The splog in question repeats the same words over and over again, so its vocabulary is much poorer than you would expect to find on a normal blog … It’s slightly more complicated than this, since the type/token ratio tends to decrease with the size of the texts. Consequently, certain corrective measures need to be taken, but I’ll spare you the details.
I looked at Philip’s 50 addresses in order to check how effective this strategy, banal as it may seem, really was. So I copied the homepages of each of the 50 blogs, converted them into text, chopped the text files up into words, and calculated the number of tokens and types and the famous type/token ratio. Don’t worry, I have tools that do all that for me! There was one blog which Philip had put in the wrong category, so I corrected that, and I only kept those pages that contained at least 50 words, which was most of them (below this amount, my calculation doesn’t really make much sense!).
Here are the results. I put the number of tokens and the type/token ratio for each of the pages in a graph. Normal blogs are in blue, splogs are in pink.
We can see how the "normal” blogs are nicely concentrated in the
cyan ellipse. Most of the splogs are completely out in space, with very low type/token ratios. There are only 7 or 8 splogs that are badly categorised and fall within the zone of normal blogs. Not bad for a strategy that even a first year student coud have come up with!
So, where’s the artificial intelligence in all this? It’s true that you have to mix a
range of criteria, but still – calling this artificial intelligence is a bit much, in my opinion. For example, the distribution of outgoing links needs to be taken into account. If most of them point to the same site, something’s probably up. The number of incoming links is also an indicator: if there are a whole lot of them, and they come from very diverse sites, it is undoubtedly not a blog. And so on. Dealing with spam is very much a game of cat and mouse. Spammers, who always prefer to put in a bare minimum of effort, do things simply at first, but the anti-spammers quickly update their defences. So the spammers have to adapt, and so it goes on.
It’s worth looking at the blogs that passed my test and fall inside the cyan ellipse. I don’t want to give them any publicity, so I haven’t made these links clickable.
- decor-home.blogspot.com
- meds4u.blogspot.com
- camouflagec54.blogspot.com
- bangg0e.blogspot.com
- digitalaudiocfd.blogspot.com
- mlb-daily.blogspot.com
- physicianemploymentpwt.blogspot.com
An important characteristic of these sites is that they make use of
extracts from real texts, such as news clips (and they also have a variety of outgoing links). I had to look at them several times before I could tell if they were really spam, and for some of them I’m still not totally convinced. After all, blogs may well exist that collect news items in a given domain (even for commercial purposes), small ads, sports results, etc. It seems to me to be difficult to draw the line between sites which are worthless, useless or commercial (but nonetheless legitimate) on the one hand, and splogs on the other. So in the end, yes, it does take intelligence to do a good job in this area, and those who may well end up paying the price are poetic, experimental and
marginal blogs who don’t meet the criteria of normal text. Imagine what Blogger or Google’s Artificial Intelligence would make of an
Oulipian poetry site, for instance. But that is surely the price we will have to pay if we don’t want the Web to turn into an immense public dumping ground.
Follow up
10 Commentaires:
Bonjour,
tu proposes 4 tests pour les liens entrants sur ton site mais tu as oublié de préciser qu'il en existe un 5e présent en permanence sur ton site : "ils en parlent... liens entrants".
résultat: 46,039 résultats...
Ce que je trouve amusant c'est que tu es plus précis dans cette nouvelle requête car tu as ajouté le protocole (http://) mais Yahoo! retourne quand même plus de résultats.
D'habitude plus on est précis moins on a de résultats. Une explications...?
Bonjour
un mot rapide sur la commande Link de google.
Elle est ostensiblement non exhaustive. Google l'a admis.
Cf par exemple ce lien sur abondance:
http://docs.abondance.com/question85.html
Cordialement,
MBt> En fait, apparemment, qu'on mette http:// ou pas, ça a l'air de retourner la même chose. La requête qui est dans le billet lui même retourne à l'instant 46 039 résultats elle aussi.... La requête restreinte à la page d'acceuil est montée à 42 449! Donc, soit, Yahoo est en train de mettre à jour sa base (en fait, je pense qu'il ont une indexation en continu), soit on atterri sur des "data centers" qui ont des états légèrement différents... A suivre, en tous cas!
Loran> Merci pour ce lien (je remet en cliquable: http://docs.abondance.com/question85.html).
J'ai effectivement déjà vu des discussions qui disent que Google ne donne qu'un échantillonnage de backlinks. J'avoue que je ne comprends pas bien pourquoi il ferait ça. Qu'il limite (comme Yahoo) la liste d'URL visisbles à 1000, je le comprends très bien, mais qu'il ne donne pas le compte réel qu'il aurait dans l'index, c'est moins clair. Peut-être des contraintes techniques dues à la façon dont l'index est agencé? Bizarre, quand même...
Rappelons qu'on met en français un espace entre la fin d'un mot et un point d'interrogation ou d'exclamation. Je ne sais pas si ce blog les enlève automatiquement car je n'en vois pas.
Yannick> Je sais bien, et je faisais ça au début, mais il faut comme vous le savez une espace insécable, sinon vous vous retrouvez régulièrement avec des ! ? : en début de ligne. Or, Blogger transfome automatiquement les entités en espace tout court.
Donc, de deux maux j'ai choisi le moindre et j'ai opté pour la suppression des espaces. Pas génial, mais le Web d'une façon générale est une offense à la belle typographie...
Une petite coquille... Rien de très important:
"je ne suis pas sûr qu'elle améliorent la qualité globale du moteur!" on doit lire "qu'elleS améliorent"
coquille: merci! ça améliore la qualité globale du blog ;-)
Bonjour.
"Toutefois, la nouveauté, c'est qu'on peut de distinguer les liens qui pointent vers l'URL stricte de la page d'accueil d'un site, de ceux qui pointent vers n'importe quelle page du site"
En fait, la commande de Yahoo! linkdomain: permettait déjà d'afficher les liens pointants vers un site tout entier (cf http://influx.joueb.com/news/247.shtml)
Christophe> Oui, mais linkdomain ne permet pas de restreindre l'affichage à un sous-site comme
www.up.univ-mrs.fr/veronis
Enregistrer un commentaire