Jean Véronis
Aix-en-Provence
(France)


Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

mercredi, septembre 24, 2008

Google: Please find attached...


I’ve dreamt about it (and I’m sure you have too), Google have done it (in part at least)... How many times have you sent a message and later realized that you have forgotten to send the attachment? Embarrassment guaranteed. It has nearly come to be a standing joke with me to say that the automatic detection of missing attachments will be one of the best selling natural language processing programs in the world. A few years ago I even had discussions with students in my seminars on the various ways of developing such a function.

Well, believe it or not Google has announced that it has developed this function as part of GMail, under the mildly sexy name of "Forgotten attachment detector".



It must seem slightly magical to some of you, almost the stuff of science-fiction (could Google now be able to guess, or even anticipate our thoughts? It’s enough to make you shiver...). I am the first to denounce false announcements, which do more harm than good in the field of language technologies (there have been a slew of them over the last half century or more, on automatic translation, man-machine dialogue, and others). We know the problem with these technologies, and the greatest modesty still reigns. As I say in my first lesson, in fifty years we have managed to decode the human genome, but not the language... In this particular case however, I do believe it’s perfectly feasible.




How on earth has Google managed to do it? Honestly I have no idea, but I can tell you how I would have done it (and it seems to me to be just about the only way). The wrong way, in my opinion, is to scratch your head and try to find expressions to detect in the body of mails: "please find attached", etc. Even if you hire the best linguists in the world, the majority will still more than likely be missed.

So here’s my recipe:
  • Take a very large mail base, millions, billions if possible (Google easily has that).
  • Split them into two piles: mails with attachments, mails without attachments.
  • Extract from each of the piles the dictionary of words that come up, or even better the n-grams that is sequences of n words that come up.
  • With the use of statistical tools, extract the n-grams which appear frequently in mails with attachments and not in mails without attachments.
  • For each new mail, check to see if one of these magical n-grams is present in the text, and if so trigger an alarm.
I’ve just done a little rough test with my own mails and I can see word sequences appearing like: "hereafter”, "attached file(s)”, "attachment(s)”, "I’m sending you”, "I’m forwarding to you”, "here is the report”, "here is the file”, "here is the/a document”, "here is the estimate”, "please find”, etc.

Of course, a program like this will generate a little noise (false alerts) and silence (missed attachments), but if 95% of cases can be detected, it’s a more than useful function.

My estimate:
  • Building a prototype: one day.
  • Developing and testing an operational version: one month.
Maybe I should offer my services to Google, since if I am to believe the mini-test featured on Pulse 2.0, it's not very good. The detector recognizes "I have attached", but not "Attach a document" or "Here is the attachment"... I tested this myself, with phrases like "Attached please find a copy of...", without much more success. Rather strange all the same.

It remains to be seen (after having resolved these few details...) if Google will offer a French version. I’ve already mentioned in the past the amount of time Google takes in localizing its products. Sometimes a few years. Watch this space.

Libellés :


0 Commentaires:

Enregistrer un commentaire