Jean Véronis
Aix-en-Provence
(France)


Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

mercredi, janvier 11, 2006

Translation: Systran or Reverso?


Linguists consider it a matter of faith to poke fun at machine translations. It is true that they often provide us with a veritable anthology of badly-constructed sentences and meaningless phrases that can border on the surreal. But the earliest research in machine translation dates from the beginning of the 1950s: more than half a century of effort has not been enough for us to succeed in cracking the code. A sign of the inherent difficulties of language, perhaps? In the same period of time, we have managed to decipher the human genome (the discovery of the double helix structure of DNA in 1953 dates from around the same time as the early days of machine translation)...



Still, progress is being made – too slowly for my liking, of course, but we mustn’t be unfair. If machine translation cannot compete with a human translator (even a bad one!), that doesn’t necessarily mean that it is completely without interest. Here’s a little experiment that I give to my students each year in my introductory course to Automatic Language Processing. Let’s take one of the day’s top stories in a Greek newspaper, Kathimerini:

Δύο νέα κρούσματα στην Τουρκία του θανατηφόρου ιού της γρίπης των πτηνών

Ο Παγκόσμιος Οργανισμός Υγείας ανακοίνωσε σήμερα στη Γενεύη ότι δυο παιδιά που νοσηλεύονται στην Τουρκία έχουν προσβληθεί από το θανατηφόρο στέλεχος Η5Ν1 του ιού της γρίπης των πτηνών.

Εκπρόσωπος του Οργανισμού δήλωσε ότι τα παιδιά, ηλικίας 5 και 8 ετών, προέρχονται από την ίδια περιοχή με τα τρία αδέλφια που πέθαναν από τη γρίπη των πτηνών αυτήν την εβδομάδα.

Σημειώνεται πως 32 άτομα νοσηλεύονται σε νοσοκομείο της πόλης Βαν με ύποπτα συμπτώματα, ενώ τουλάχιστον πέντε περιοχές της ανατολικής Τουρκίας έχουν τεθεί σε καραντίνα.


Probably doesn’t mean a lot to you, does it? I always deliberately chose a language that few people are likely to know. Greek is perfect because we can’t even hazard a guess at what the text is about from the form of the words (whereas we can decipher English, Spanish or German even if we don’t speak the language). Chinese or Japanese would also make good candidates!

Let’s compare this with the version translated by Babelfish:

Two new cases in Turkey of leathal virus of flu of birds

The World Organism of Health announced today in Geneva that two children that nosiley'ontaj in Turkey they have been offended by leathal executive I5N1 of virus of flu of birds.

Representative of Organism declared that the children, age 5 and 8 years, emanate from the same region with the three brothers that died from the flu of birds this week.

It is marked that 32 individuals nosiley'ontaj in hospital of city Van with suspect symptoms, while at least five regions of Eastern Turkey have been placed in quarantine.

This translation is a perfect example of the state of the art in the field. We can understand the general subject matter (bird flu in Turkey), and we can even list the main facts: two children aged 5 and 8 have died from bird flu in Turkey, 32 people have been hospitalised with suspicious symptoms, five regions are under quarantine, etc. Some of the errors are stupid: H5N1 is translated as I5N1, νοσηλεύονται (hospitalised, cured) is missing from the dictionary. Things could easily be improved.

On no account should such translations be used as final documents, and I’m always stunned when students (or colleagues!) proudly announce how they have had their résumé machine translated for a conference! But machine translation has reached the point where it can now legitimately be used as a tool for deciphering a text, a way of quickly getting to know the subject matter and general content of pages in foreign languages, for those situations where paying a translator would be inconceivable. It is used, for instance, in economic monitoring, and can be prove useful for ordinary web surfers as well: although by far the majority of documents on the web are written in English, less than 30% of web surfers are English-speakers (according to a study carried out by Byte Level), and this proportion is falling all the time.

It comes as no surprise, then, that most search engines offer the option of translating any pages returned. But with such a considerable potential market, it is quite surprising to see that the offer is so limited: Google and Yahoo both use the same technology, the Systran system, which is also behind Babelfish (Altavista). At first, French search engine Voila used Reverso by Softissimo, before finally opting for Systran as well … Portals like AOL and Wanadoo also offer Systran. Indeed, Systran has Internet operators to thank for the lion’s share of its turnover.

In the midst of such widespread systrannisation, Ask Jeeves recently made the surprising announcement that it is to associate with Reverso [via DSI (fr)], which is also available on the search engine’s French beta version.

Is this a bad choice? In order to find out, we asked 58 students from the first year of our degree course to look into the question. Our project consisted of having students translate a text of their choice, of at least 500 words in length, from their second language into their mother tongue (in order to enable them to correctly judge the quality of the end result), using both Reverso and Systran (on the Babelfish site). Each student then had to deliver a detailed report on the errors and their probable causes (word missing from dictionary, etc); don’t worry, I’ll spare you the details. The final question asked each student to choose whether it was Reverso or Systran that provided the most readable translation.

The results are quite categorical:

SourceTargetReversoSystran
GermanFrench20
EnglishFrench155
ItalianFrench81
SpanishFrench206
French
English01
Total
4513


For all the languages studied, the choice was clear - Reverso.



So... could this be a smart move on the part of Ask Jeeves? In any case, Systran, who has fallen out with its traditional “cash cow” the European Commission [see Le Monde, Systran (fr)] will have to buck its ideas up if it is to survive in the pitiless world of the Internet operators.

Thanks Estelle for going through the study.

8 Commentaires:

Blogger Justin a écrit...

All but the last example target French. I would like to know if a translator works better in one direction than in another. For example, is it possible that Reverso makes more readable translations into french while another translator does a better job targeting English?

16 janvier, 2006 09:57  
Anonymous Anonyme a écrit...

For a full set of independently written case studies, tips, hints, tricks, and comparison reports concerning both the Reverso (PROMT-based) and SYSTRAN machine translation software packages, please refer to:

The Language Software Evaluation/Review site:
http://www.geocities.com/langtecheval/

The MT Tips site:
http://www.geocities.com/jeffallenpubs/MT-tips.htm

MT Forum
http://www.translators.com
Menu bar: Community > Discussion Forums
Go to Machine Translation Forum

MT user forums on Yahoo Groups
http://groups.yahoo.com/group/Reverso_users/
http://groups.yahoo.com/group/SYSTRAN_users/
http://groups.yahoo.com/group/PROMT_users/

Jeff Allen

17 janvier, 2006 22:03  
Blogger Justin a écrit...

Actually, I was only wondering in general. One might think that a translator would do equally well in either direction at least on the lexical level. That is to say if the translator lacks a word on either side it may as well lack it for both. I see no reason however why a translator couldn't be unbalanced semantically and syntactically. (Not that one can really extract any of these categories.)

18 janvier, 2006 12:25  
Blogger Jean Véronis a écrit...

Apologies to all of you. I'm pretty far behind in my responses (I've been very buzy with the clouds).

Justin, Jeff> My study involved mostly French as a target for obvious reasons of student availability. I have no emprical grounds to assess any kind of symmetry of asymetry in MT systems. One would have to run the expriment in the reverse direction, which I haven't done. however, knowing a little bit about MT and NL systmes in general, I suspect that there are many reasons why we could have asymetry. One of the reasons is that most language-translation pairs in most systems involve English. Therefore the lexicons, compound detection, grammatical rules, etc. are likely to be better for English. My intuition would be that the general trend is a better analysis when English is the source and a better generation with English is the target. Is this true? how do the two factors combine in practice? I have no means to know without running extensive tests.

18 janvier, 2006 13:25  
Blogger Justin a écrit...

Thank you, Jeff.

23 janvier, 2006 09:32  
Blogger Unknown a écrit...

Hello Gentlmen,

Sorry for breaking in that late - this discussion just got indexed by my Google News tracker.

Just wanted to let you know both Voila (tr.voila.fr) and Orange (traduction.orange.fr) are now using original Promt translation service, so the landscape is becoming a little bit more diversified - at least in France.

Regards,
Nikolay Vasiliev

28 août, 2007 20:34  
Anonymous language translation a écrit...

Interesting post. Its true that machine translation are increasingly becoming more effective but need to start interpreting idioms and understanding cultural context of text before it could truly replace human translation.

26 septembre, 2009 19:39  
Blogger Unknown a écrit...

Hi Language Translation:

But it is possible to handle idioms and stylistic expressions with various MT software programs. I do it all the time. The objective is to use the MT software as an assistance tool the human translator. As for cultural aspects, it is possible to handle localization variants within such tools, with varying levels of usability. I have worked in real translation production projects with 30+ versions of MT software (and 2 brand new ones received recently to start trying out), and having used MT to translate a very wide range of topics, domains and document types.
see: http://www.proz.com/post/1268576#1268576

Many people write in forums that MT should/could/would not work, but those words all clearly indicate to me that those people have never really tried it, or they tried with a free online translator rather than a professional or expert deskstop system designed for the purpose.
Would you try and use a 1 or 3 speed bicycle to do the Tour de France. Of course not, you need a 27-speed bike is more appropriate.

I always write can/does/makes in my statements about MT, because I do use it and write case studies about my implementations.

Jeff

31 janvier, 2010 05:21  

Enregistrer un commentaire