Translation: Systran or Reverso?
Linguists consider it a matter of faith to poke fun at machine translations. It is true that they often provide us with a veritable anthology of badly-constructed sentences and meaningless phrases that can border on the surreal. But the earliest research in machine translation dates from the beginning of the 1950s: more than half a century of effort has not been enough for us to succeed in cracking the code. A sign of the inherent difficulties of language, perhaps? In the same period of time, we have managed to decipher the human genome (the discovery of the double helix structure of DNA in 1953 dates from around the same time as the early days of machine translation)...
Still, progress is being made – too slowly for my liking, of course, but we mustn’t be unfair. If machine translation cannot compete with a human translator (even a bad one!), that doesn’t necessarily mean that it is completely without interest. Here’s a little experiment that I give to my students each year in my introductory course to Automatic Language Processing. Let’s take one of the day’s top stories in a Greek newspaper, Kathimerini:
Δύο νέα κρούσματα στην Τουρκία του θανατηφόρου ιού της γρίπης των πτηνών
Ο Παγκόσμιος Οργανισμός Υγείας ανακοίνωσε σήμερα στη Γενεύη ότι δυο παιδιά που νοσηλεύονται στην Τουρκία έχουν προσβληθεί από το θανατηφόρο στέλεχος Η5Ν1 του ιού της γρίπης των πτηνών.
Εκπρόσωπος του Οργανισμού δήλωσε ότι τα παιδιά, ηλικίας 5 και 8 ετών, προέρχονται από την ίδια περιοχή με τα τρία αδέλφια που πέθαναν από τη γρίπη των πτηνών αυτήν την εβδομάδα.
Σημειώνεται πως 32 άτομα νοσηλεύονται σε νοσοκομείο της πόλης Βαν με ύποπτα συμπτώματα, ενώ τουλάχιστον πέντε περιοχές της ανατολικής Τουρκίας έχουν τεθεί σε καραντίνα.[original]
Probably doesn’t mean a lot to you, does it? I always deliberately chose a language that few people are likely to know. Greek is perfect because we can’t even hazard a guess at what the text is about from the form of the words (whereas we can decipher English, Spanish or German even if we don’t speak the language). Chinese or Japanese would also make good candidates!
Let’s compare this with the version translated by Babelfish:
Two new cases in Turkey of leathal virus of flu of birds
The World Organism of Health announced today in Geneva that two children that nosiley'ontaj in Turkey they have been offended by leathal executive I5N1 of virus of flu of birds.
Representative of Organism declared that the children, age 5 and 8 years, emanate from the same region with the three brothers that died from the flu of birds this week.
It is marked that 32 individuals nosiley'ontaj in hospital of city Van with suspect symptoms, while at least five regions of Eastern Turkey have been placed in quarantine.
This translation is a perfect example of the state of the art in the field. We can understand the general subject matter (bird flu in Turkey), and we can even list the main facts: two children aged 5 and 8 have died from bird flu in Turkey, 32 people have been hospitalised with suspicious symptoms, five regions are under quarantine, etc. Some of the errors are stupid: H5N1 is translated as I5N1, νοσηλεύονται (hospitalised, cured) is missing from the dictionary. Things could easily be improved.
On no account should such translations be used as final documents, and I’m always stunned when students (or colleagues!) proudly announce how they have had their résumé machine translated for a conference! But machine translation has reached the point where it can now legitimately be used as a tool for deciphering a text, a way of quickly getting to know the subject matter and general content of pages in foreign languages, for those situations where paying a translator would be inconceivable. It is used, for instance, in economic monitoring, and can be prove useful for ordinary web surfers as well: although by far the majority of documents on the web are written in English, less than 30% of web surfers are English-speakers (according to a study carried out by Byte Level), and this proportion is falling all the time.
It comes as no surprise, then, that most search engines offer the option of translating any pages returned. But with such a considerable potential market, it is quite surprising to see that the offer is so limited: Google and Yahoo both use the same technology, the Systran system, which is also behind Babelfish (Altavista). At first, French search engine Voila used Reverso by Softissimo, before finally opting for Systran as well … Portals like AOL and Wanadoo also offer Systran. Indeed, Systran has Internet operators to thank for the lion’s share of its turnover.
In the midst of such widespread systrannisation, Ask Jeeves recently made the surprising announcement that it is to associate with Reverso [via DSI (fr)], which is also available on the search engine’s French beta version.
Is this a bad choice? In order to find out, we asked 58 students from the first year of our degree course to look into the question. Our project consisted of having students translate a text of their choice, of at least 500 words in length, from their second language into their mother tongue (in order to enable them to correctly judge the quality of the end result), using both Reverso and Systran (on the Babelfish site). Each student then had to deliver a detailed report on the errors and their probable causes (word missing from dictionary, etc); don’t worry, I’ll spare you the details. The final question asked each student to choose whether it was Reverso or Systran that provided the most readable translation.
The results are quite categorical:
For all the languages studied, the choice was clear - Reverso.
So... could this be a smart move on the part of Ask Jeeves? In any case, Systran, who has fallen out with its traditional “cash cow” the European Commission [see Le Monde, Systran (fr)] will have to buck its ideas up if it is to survive in the pitiless world of the Internet operators.
Thanks Estelle for going through the study.