Jean Véronis

Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

vendredi, septembre 09, 2005

Web: Google, Blogger and splogs

Splogs (a newly-coined word made up of spam + blog) are to blogs what spam is to email… Annoying little things designed to sell you Viagra or a whole host of other, equally suspect, services. How they work is quite simple: you open a free blog (or hundreds of them) which you stuff full of dummy text and – most importantly – links to the real site where you plan to sell us something (or, more probably, rip us off). Then all you have to do is wait until Google comes calling and, since Google is very good at indexing spam ;-), the customers will soon start pouring in...

Here’s a typical example (the title of each post is a link to a .biz site):

Blogger is obviously a major source of splogging. Free, easy to set up and fill using automated procedures, and well indexed by Google (see here [fr]) – when you learn that Blogger is a subsidiary of Google, you may well wonder if Google isn’t giving Blogger a little helping hand here (just compare the positioning of sites on with those on Yahoo or MSN). But the great paradox is that, in doing this, Google is polluting itself by generously indexing the splog generated by Blogger...

I’ve just read (a little late, I admit) an extremely interesting post by Philip Lenssen (Google Blogoscoped) in which he carries out a survey of fifty Blogger blogs and discovers that 60% of them are spam! I expected the proportion to be high, but not this high – frankly, I’m flabbergasted. If we were to venture an attempt to extrapolate this figure, it would mean that of the 32,700,000 pages Google claims to have indexed on the domain (Philip says 7,500,000 but this search gives me a lot more), more than 20 million of them would be spam.

Google seems to have realised that it was shooting itself in the foot with this affair, and apparently measures have been taken. At the end of August, Blogger added a “Flag?” button to the navigation bar that (usually) appears at the top of each blog, allowing visitors to report sites that seem to be spam.

This button seems suspect to me, for two reasons. Firstly, it allows for co-ordinated attacks against blogs that might upset a certain group or community... which makes me shudder just to think about it [thanks to Nathan Weinberg for the link]. But, more importantly, this button is completely useless, since it’s a simple task to just remove the Blogger navigation bar altogether (as I’ve done here on this blog by way of demonstration!). Sometimes I wonder … Google and Blogger pay their researchers and engineers a lot of money to come up with this sort of thing. It never ceases to amaze me.

But more seriously, Blogger (who must have some good engineers as well) seems to have put in place an effective anti-splog filtering system. Island Dave points out that when you click on Blogger’s “Next Blog” button, you no longer land on a spam-filled page. This is confirmed by Blogger, who claims to have “put some Artificial Intelligence to work”, no less!

As far as Artificial Intelligence is concerned, the procedures for detecting spam are quite well known. Here’s one, for instance, that I use in my classes to explain some basic notions about the distribution of words in texts, Zipf’s law (which I will no doubt return to one day), etc.

Take a text, any text. The Little Prince, for instance (and don’t bother looking, it’s not on the Web because it’s not out of copyright). Calculate the number of words. Hang on, there’s a problem with the ambiguity of the word word … Does the sentence “The Little Prince draws the little sheep” contain 7 words or 5? Why both, my dear Watson! There are 7 words separated by spaces, but only 5 different words. To differentiate between the two, we talk of tokens in the first case, and types in the second: 7 tokens, 5 types.

Now that we’ve cleared up this matter of words, let’s get to work. Using, for example, my (free) program Dico, we can see that The Little Prince [the original French version] contains 15,352 tokens and only 2412 types. This provides a type/token ratio of 0.16. Let’s look now at the cooking splog I used as an example at the beginning of this post. It has a type/token ratio of just 0.015 - ten times less! Why? It’s quite simple really. The splog in question repeats the same words over and over again, so its vocabulary is much poorer than you would expect to find on a normal blog … It’s slightly more complicated than this, since the type/token ratio tends to decrease with the size of the texts. Consequently, certain corrective measures need to be taken, but I’ll spare you the details.

I looked at Philip’s 50 addresses in order to check how effective this strategy, banal as it may seem, really was. So I copied the homepages of each of the 50 blogs, converted them into text, chopped the text files up into words, and calculated the number of tokens and types and the famous type/token ratio. Don’t worry, I have tools that do all that for me! There was one blog which Philip had put in the wrong category, so I corrected that, and I only kept those pages that contained at least 50 words, which was most of them (below this amount, my calculation doesn’t really make much sense!).

Here are the results. I put the number of tokens and the type/token ratio for each of the pages in a graph. Normal blogs are in blue, splogs are in pink.

We can see how the "normal” blogs are nicely concentrated in the cyan ellipse. Most of the splogs are completely out in space, with very low type/token ratios. There are only 7 or 8 splogs that are badly categorised and fall within the zone of normal blogs. Not bad for a strategy that even a first year student coud have come up with!

So, where’s the artificial intelligence in all this? It’s true that you have to mix a range of criteria, but still – calling this artificial intelligence is a bit much, in my opinion. For example, the distribution of outgoing links needs to be taken into account. If most of them point to the same site, something’s probably up. The number of incoming links is also an indicator: if there are a whole lot of them, and they come from very diverse sites, it is undoubtedly not a blog. And so on. Dealing with spam is very much a game of cat and mouse. Spammers, who always prefer to put in a bare minimum of effort, do things simply at first, but the anti-spammers quickly update their defences. So the spammers have to adapt, and so it goes on.

It’s worth looking at the blogs that passed my test and fall inside the cyan ellipse. I don’t want to give them any publicity, so I haven’t made these links clickable.
An important characteristic of these sites is that they make use of extracts from real texts, such as news clips (and they also have a variety of outgoing links). I had to look at them several times before I could tell if they were really spam, and for some of them I’m still not totally convinced. After all, blogs may well exist that collect news items in a given domain (even for commercial purposes), small ads, sports results, etc. It seems to me to be difficult to draw the line between sites which are worthless, useless or commercial (but nonetheless legitimate) on the one hand, and splogs on the other. So in the end, yes, it does take intelligence to do a good job in this area, and those who may well end up paying the price are poetic, experimental and marginal blogs who don’t meet the criteria of normal text. Imagine what Blogger or Google’s Artificial Intelligence would make of an Oulipian poetry site, for instance. But that is surely the price we will have to pay if we don’t want the Web to turn into an immense public dumping ground.

Follow up

2 Commentaires:

Anonymous Anonyme a écrit...

Very nice approach to determine splogs, but as you may notice in the graph, there is many splogs concentrated with normal blogs.

I have already tryed similar approach to detect splogs but there was many consideration that make me change opinion to do some artificial intelligence calculation specially : speed of the detection algorythm.

I have an implementation that you may see at, I would love to have your opinion on it.

14 septembre, 2005 22:56  
Blogger Jean Véronis a écrit...

Hatem> Thanks for your message!

1. Very nice approach ... but
Please note that this is only a small experiment to serve as a sort of tutorial on the type/token ratio. As I said in the post, in a real system many sources of information should be used and combined (for example with a bayesian strategy).

2. Speed of algorithm
I agree that this is a concern (congratulation, your site responds very quickly!). However, the type/token ratio computation is not costly. I assume that you perform some kind of tokenisation anyway inside your program. It is the only costly part.

See my post.

Good luck!

15 septembre, 2005 10:07  

Enregistrer un commentaire