Jean Véronis

Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

jeudi, septembre 15, 2005

Splogs: system

Hatem from has left a comment on my post "Google, Blogger and splogs", asking for my opinion about his site. is an on-line service launched a few days ago, that enables you to check whether a given URL is likely to be a splog.

As explained here, to use it, you simply send the query:
where the_url_to_check is the blog that you're trying to check. will return :
  • 1 : if the blog is detected as a SPLOG
  • 0 : if not.
  • 3 : if the URL don't open due to a DNS error, 404 error ... etc
I sent the set of URLs that I borrowed from Philip Lenssen, which I used in my previous post (only 42 respond this morning). The results are quite impressive:




Total correct39 (92%)


Normal (false positives)2

Spam (false negatives)1

Total wrong3 (8%)

A success rate above 90% is quite impressive for a system that young, especially since, as I noted before, some of these splogs are quite difficult to tell apart from normal ones, even for the human eye. Congratulations then. I'll be following how the system develops with great interest.

If I can give one piece of advice for the future, I would try to decrease the false positive rate (i.e. normal blogs reported as spam). At the moment, this rate is 2/19, i.e. ca. 10% (although of course a precise assessment is difficult on such a small number of URLs). It seems to me quite dangerous to report legitimate blogs as spam, and I would be happier that this rate fall well below 1%, even if the price to pay is to let more splogs through the net.

Of course, spammers monitor all this (see here for instance), and I am pretty sure that they will come up soon with splog- generating software to produce human-looking texts which will be extremely difficult to tell apart from real human texts by automatic means.

Anyway, congratulations again, Hatem, and good luck with your system!

1 Commentaires:

Blogger JoeC a écrit...

Some spammers are already creating splogs with human created text. They just steal text from other sites (Wikipedia being an obvious choice).

But even with actual human created text there are still characteristics splogs do not share with normal blogs. They are much harder to detect by a human unless you recognize the text is stolen, but hopefully can identify most of them based on their other spammy characteristics.

16 septembre, 2005 00:13  

Enregistrer un commentaire