Jean Véronis
Aix-en-Provence
(France)


Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

mercredi, janvier 11, 2006

Translation: Systran or Reverso?


Linguists consider it a matter of faith to poke fun at machine translations. It is true that they often provide us with a veritable anthology of badly-constructed sentences and meaningless phrases that can border on the surreal. But the earliest research in machine translation dates from the beginning of the 1950s: more than half a century of effort has not been enough for us to succeed in cracking the code. A sign of the inherent difficulties of language, perhaps? In the same period of time, we have managed to decipher the human genome (the discovery of the double helix structure of DNA in 1953 dates from around the same time as the early days of machine translation)...



Still, progress is being made – too slowly for my liking, of course, but we mustn’t be unfair. If machine translation cannot compete with a human translator (even a bad one!), that doesn’t necessarily mean that it is completely without interest. Here’s a little experiment that I give to my students each year in my introductory course to Automatic Language Processing. Let’s take one of the day’s top stories in a Greek newspaper, Kathimerini:

Δύο νέα κρούσματα στην Τουρκία του θανατηφόρου ιού της γρίπης των πτηνών

Ο Παγκόσμιος Οργανισμός Υγείας ανακοίνωσε σήμερα στη Γενεύη ότι δυο παιδιά που νοσηλεύονται στην Τουρκία έχουν προσβληθεί από το θανατηφόρο στέλεχος Η5Ν1 του ιού της γρίπης των πτηνών.

Εκπρόσωπος του Οργανισμού δήλωσε ότι τα παιδιά, ηλικίας 5 και 8 ετών, προέρχονται από την ίδια περιοχή με τα τρία αδέλφια που πέθαναν από τη γρίπη των πτηνών αυτήν την εβδομάδα.

Σημειώνεται πως 32 άτομα νοσηλεύονται σε νοσοκομείο της πόλης Βαν με ύποπτα συμπτώματα, ενώ τουλάχιστον πέντε περιοχές της ανατολικής Τουρκίας έχουν τεθεί σε καραντίνα.


Probably doesn’t mean a lot to you, does it? I always deliberately chose a language that few people are likely to know. Greek is perfect because we can’t even hazard a guess at what the text is about from the form of the words (whereas we can decipher English, Spanish or German even if we don’t speak the language). Chinese or Japanese would also make good candidates!

Let’s compare this with the version translated by Babelfish:

Two new cases in Turkey of leathal virus of flu of birds

The World Organism of Health announced today in Geneva that two children that nosiley'ontaj in Turkey they have been offended by leathal executive I5N1 of virus of flu of birds.

Representative of Organism declared that the children, age 5 and 8 years, emanate from the same region with the three brothers that died from the flu of birds this week.

It is marked that 32 individuals nosiley'ontaj in hospital of city Van with suspect symptoms, while at least five regions of Eastern Turkey have been placed in quarantine.

This translation is a perfect example of the state of the art in the field. We can understand the general subject matter (bird flu in Turkey), and we can even list the main facts: two children aged 5 and 8 have died from bird flu in Turkey, 32 people have been hospitalised with suspicious symptoms, five regions are under quarantine, etc. Some of the errors are stupid: H5N1 is translated as I5N1, νοσηλεύονται (hospitalised, cured) is missing from the dictionary. Things could easily be improved.

On no account should such translations be used as final documents, and I’m always stunned when students (or colleagues!) proudly announce how they have had their résumé machine translated for a conference! But machine translation has reached the point where it can now legitimately be used as a tool for deciphering a text, a way of quickly getting to know the subject matter and general content of pages in foreign languages, for those situations where paying a translator would be inconceivable. It is used, for instance, in economic monitoring, and can be prove useful for ordinary web surfers as well: although by far the majority of documents on the web are written in English, less than 30% of web surfers are English-speakers (according to a study carried out by Byte Level), and this proportion is falling all the time.

It comes as no surprise, then, that most search engines offer the option of translating any pages returned. But with such a considerable potential market, it is quite surprising to see that the offer is so limited: Google and Yahoo both use the same technology, the Systran system, which is also behind Babelfish (Altavista). At first, French search engine Voila used Reverso by Softissimo, before finally opting for Systran as well … Portals like AOL and Wanadoo also offer Systran. Indeed, Systran has Internet operators to thank for the lion’s share of its turnover.

In the midst of such widespread systrannisation, Ask Jeeves recently made the surprising announcement that it is to associate with Reverso [via DSI (fr)], which is also available on the search engine’s French beta version.

Is this a bad choice? In order to find out, we asked 58 students from the first year of our degree course to look into the question. Our project consisted of having students translate a text of their choice, of at least 500 words in length, from their second language into their mother tongue (in order to enable them to correctly judge the quality of the end result), using both Reverso and Systran (on the Babelfish site). Each student then had to deliver a detailed report on the errors and their probable causes (word missing from dictionary, etc); don’t worry, I’ll spare you the details. The final question asked each student to choose whether it was Reverso or Systran that provided the most readable translation.

The results are quite categorical:

SourceTargetReversoSystran
GermanFrench20
EnglishFrench155
ItalianFrench81
SpanishFrench206
French
English01
Total
4513


For all the languages studied, the choice was clear - Reverso.



So... could this be a smart move on the part of Ask Jeeves? In any case, Systran, who has fallen out with its traditional “cash cow” the European Commission [see Le Monde, Systran (fr)] will have to buck its ideas up if it is to survive in the pitiless world of the Internet operators.

Thanks Estelle for going through the study.

13 Commentaires:

Blogger Justin a écrit...

All but the last example target French. I would like to know if a translator works better in one direction than in another. For example, is it possible that Reverso makes more readable translations into french while another translator does a better job targeting English?

16 janvier, 2006 09:57  
Anonymous Anonyme a écrit...

For a full set of independently written case studies, tips, hints, tricks, and comparison reports concerning both the Reverso (PROMT-based) and SYSTRAN machine translation software packages, please refer to:

The Language Software Evaluation/Review site:
http://www.geocities.com/langtecheval/

The MT Tips site:
http://www.geocities.com/jeffallenpubs/MT-tips.htm

MT Forum
http://www.translators.com
Menu bar: Community > Discussion Forums
Go to Machine Translation Forum

MT user forums on Yahoo Groups
http://groups.yahoo.com/group/Reverso_users/
http://groups.yahoo.com/group/SYSTRAN_users/
http://groups.yahoo.com/group/PROMT_users/

Jeff Allen

17 janvier, 2006 22:03  
Blogger mtpostediting a écrit...

It is very important to make a distinction between the use of Machine Translation for Inbound translation (content gisting) and the use of it for Outbound translation (translation for publication), the latter being which professional translators do for a living. It is important to note that this article on SYSTRAN and Reverso focuses on the Inbound translation-based MT systems. However, I have always trained professional translators on Outbound translation-featured MT commercial software packages and customized industry-built MT software. And I discourage the use of free online MT systems for any type of Outbound translation activity. A parallel is: Why use Windows Notepad to write a Masters thesis when you can purchase MS Windows with a student discount or install OpenOffice?

A 10-15 page powerpoint presentation on the topic of Inbound versus Outbound translation with a indication of each module in several MT software packages (including Reverso and SYSTRAN) per their use for Inbound or Outbound purposes is available at:
http://www.translatorscafe.com/cafe/article50.htm

Also at that page is a 1-page article entitled "Thinking about Machine Translation" which provides short answers to some key questions over the past 15 years on the debate of MT.

Lastly, there is a step-by-step how-to document entitled "Getting Started with Machine Translation" which shows how to transition from the free online MT systems to the packaged software applications which contain translation-friendly features.

Jeff Allen

17 janvier, 2006 22:31  
Blogger mtpostediting a écrit...

justin barker wrote...
>>All but the last example target French. I would like to know if a translator works better in one direction than in another. For example, is it possible that Reverso makes more readable translations into french while another translator does a better job targeting English?

For a professional translator in either direction, they should not use the Inbound-only push-button online MT systems. They should purchase an MT software package which contains translator-friendly features for translation productivity. Any serious use of MT by a professional translator must include a combination of user dictionary building and MT postediting.

2 case studies on MT dictionary building in translation production contexts:

http://www.geocities.com/jeffallenpubs/Allen-LI-article-Reverso.pdf

http://www.geocities.com/mtpostediting/Jeff-Allen-AMTA2004-paper_v1.01.pdf

and all about MT postediting at the site that is dedicated to this topic:
http://www.geocities.com/mtpostediting/

Jeff Allen
Certified MT Dictionary developer

17 janvier, 2006 22:46  
Blogger Justin a écrit...

Actually, I was only wondering in general. One might think that a translator would do equally well in either direction at least on the lexical level. That is to say if the translator lacks a word on either side it may as well lack it for both. I see no reason however why a translator couldn't be unbalanced semantically and syntactically. (Not that one can really extract any of these categories.)

18 janvier, 2006 12:25  
Blogger Jean Véronis a écrit...

Apologies to all of you. I'm pretty far behind in my responses (I've been very buzy with the clouds).

Justin, Jeff> My study involved mostly French as a target for obvious reasons of student availability. I have no emprical grounds to assess any kind of symmetry of asymetry in MT systems. One would have to run the expriment in the reverse direction, which I haven't done. however, knowing a little bit about MT and NL systmes in general, I suspect that there are many reasons why we could have asymetry. One of the reasons is that most language-translation pairs in most systems involve English. Therefore the lexicons, compound detection, grammatical rules, etc. are likely to be better for English. My intuition would be that the general trend is a better analysis when English is the source and a better generation with English is the target. Is this true? how do the two factors combine in practice? I have no means to know without running extensive tests.

18 janvier, 2006 13:25  
Blogger mtpostediting a écrit...

18 January, 2006, justin barker wrote:
Actually, I was only wondering in general. One might think that a translator would do equally well in either direction at least on the lexical level. That is to say if the translator lacks a word on either side it may as well lack it for both.


Justin, I'm going to interpret your use of the word "translator" to be a "machine translation system/software program". It could also be understood to be a human translator, and that merits discussion with regard to your questions as well, but this context seems to be focused on the MT software/systems such as SYSTRAN and Reverso.

MT systems are not necessarily equal bi-directionally at a lexical level in their general dictionaries for a few reasons.

1) the misnomer of 1-to-1 translation. Although it would be easier to process language if each term had a single exact matching equivalent in the target language, this is often not the case. Multi-referential equivalent are very common. When I was a technical writing and translation trainer at Caterpillar, we came across many examples where the same term is said several ways in the same text. It is possible to say "radiator cap", "filler cap", radiator filler cap", "radiator's filler cap", not to mention specific types like a "locking radiator cap" and even a "thermally locking radiator cap".

2) Overlapping meaning for the same term: what do you do when "filler cap" can be used for different contexts "radiator filler cap", "oil filler cap", "engine oil filler cap", "transmission oil filler cap" etc. Same problem with "filter" which can be used in different contexts: "air filter" & "engine air filter", "oil filter" & "engine oil filter" "gasoline filter" etc. If only one translation variant is coded for "filler cap" and "filter", imagine the mess when in an instruction about replacing radiator fluid that the person is told to release the pressure in the of the radiator filler cap and then a couple of lines later is told to remove the filler cap, but the translation cames out as "release the pressure of the radiator cap, and then remove the oil filler cap".

and in the opposite direction if you have several variant terms as well:
bouchon radiateur
bouchon de radiateur
bouchon de radiateur d'automobile
bouchon radiateur à eau voiture

A missing entry, or a under-specified entry for the translation in either direction can easily lead to lexical asymetry.

3) another issue is variability in spelling of the same word/term in a source language (such as spell-checker, spellchecker and spell checker in English). This means that 1, 2 or 3 of the variants might be coded into the English general dictionary with a single (or several) equivalent French output entry(ies). If all of the variants are coded, then it will provide for high accuracy at the lexical level. Yet, if in the opposition direction (French > Eng) there also happens to be spelling variation, and not all of the variant entries are included, then a mismatch at the lexical level is quite possible. Several articles on lexical variability are available at my website.

4) creating a general user dictionary is a very time-consuming and meticulous task. It is important to provide the best coverage of the most frequent terms used, combined with selecting which translation equivalent(s) is/are the best choice(s) to code into the target language field. For example, it is possible to enter in the 8 different meanings of the term "valve" into an MT system used in a automotive/heavy-machinery context because the technical writers and translators will be using all of the different meanings of the term. Yet for a general-use MT system, it is important to choose the most appropriate translation that will have the highest coverage of good understanding by readers/users across a wide variety of fields. The "most appropriate translation" is not a magical formula. Various factors can influence that choice, especially if the word/terms has 2 or even 3 distinct semantic meanings in the target language. So semantic factors come into play, as do frequency of use. An example of this is that it is less interesting to include a term that appears 25 times on a Google search versus a term that appears 4 million times. If you combine all of these factors which influence a single translation direction, and then have to consider them again in creating the dictionary in the opposite direction, it leads to a complex matrix.

Achieving symmetry of inclusion and content of lexical entries for bi-directional MT dictionaries is a big task.
This is true for pre-packaged general use MT dictionaries (which come standard with the tool), domain/topical specialized dictionaries, and custom user dictionaries for major MT software systems.
A last point to note is that it has not been until very recently that MT software packages have offered features which allow users, as they create a dictionary entry in one directly, to (semi)automatically create the corresponding dictionary entry in the opposition direction.

Jeff

21 janvier, 2006 10:32  
Blogger mtpostediting a écrit...

18 January, 2006, justin barker wrote:
I see no reason however why a translator couldn't be unbalanced semantically and syntactically. (Not that one can really extract any of these categories.)


See my posting above, on lexical (a)symmetry in MT systems, which provides some points on semantic issues. Semantic variation can be based on the level of precision of the terms that are used versus more general terms which are used to cover more specific terms (for ex: filter to mean both engine oil filter and air filter).

Most MT systems do use semantic classes, but to a lesser or greater extent. I conducted an analysis on this between the Reverso v5 system and the PROMT v6 system (both of which are based on a PROMT kernel). This report is available at:
http://groups.yahoo.com/group/Reverso_users/ (see message 6)
http://groups.yahoo.com/group/PROMT_users/ (see message 7)


Jeff

21 janvier, 2006 11:04  
Blogger mtpostediting a écrit...

18 January, 2006, justin barker wrote:
I see no reason however why a translator couldn't be unbalanced semantically and syntactically.


Syntactic asymmetry is very common. This is why backtranslation techniques are usually discouraged when using MT systems. I wrote an article on using backtranslation. See "Getting Started with Machine Translation" at:
http://www.translatorscafe.com/cafe/article50.htm

Jeff

21 janvier, 2006 11:17  
Blogger Justin a écrit...

Thank you, Jeff.

23 janvier, 2006 09:32  
Blogger Unknown a écrit...

Hello Gentlmen,

Sorry for breaking in that late - this discussion just got indexed by my Google News tracker.

Just wanted to let you know both Voila (tr.voila.fr) and Orange (traduction.orange.fr) are now using original Promt translation service, so the landscape is becoming a little bit more diversified - at least in France.

Regards,
Nikolay Vasiliev

28 août, 2007 20:34  
Anonymous language translation a écrit...

Interesting post. Its true that machine translation are increasingly becoming more effective but need to start interpreting idioms and understanding cultural context of text before it could truly replace human translation.

26 septembre, 2009 19:39  
Blogger Unknown a écrit...

Hi Language Translation:

But it is possible to handle idioms and stylistic expressions with various MT software programs. I do it all the time. The objective is to use the MT software as an assistance tool the human translator. As for cultural aspects, it is possible to handle localization variants within such tools, with varying levels of usability. I have worked in real translation production projects with 30+ versions of MT software (and 2 brand new ones received recently to start trying out), and having used MT to translate a very wide range of topics, domains and document types.
see: http://www.proz.com/post/1268576#1268576

Many people write in forums that MT should/could/would not work, but those words all clearly indicate to me that those people have never really tried it, or they tried with a free online translator rather than a professional or expert deskstop system designed for the purpose.
Would you try and use a 1 or 3 speed bicycle to do the Tour de France. Of course not, you need a 27-speed bike is more appropriate.

I always write can/does/makes in my statements about MT, because I do use it and write case studies about my implementations.

Jeff

31 janvier, 2010 05:21  

Enregistrer un commentaire