ISTRION - Improving Phrase-Based Statistical Machine TRanslation through supervisION
Mar 2011 - Feb 2013
In earlier projects (DIXIT, contract PRAXIS 2/2.1/TIT/1670/95; TRADAUT-PT, European contract MLIS-4005 TRADAUT-PT 26192; ASTROLABIUM, European contract MOBI-CT-2003-003344; and PATRAS, contract POSC/PLP/61520/2004), and in currently on-going project VIP-ACCESS (PTDC/PLP/72142/2006), we developed innovative language independent text mining procedures, applied to raw text: for aligning parallel texts (i.e., two texts are parallel if they translate one another and the alignment procedure breaks parallel texts into segments that should continue to be parallel, i.e. translations of each other) (Ribeiro et al, 2000c; Gomes et al, 2009), for extracting word and phrase translations (Ribeiro et al, 2000b; Aires et al, 2009), for extracting multi-word terms (Silva et al, 1999; Aires et al, 2008), for clustering documents (Silva et al, 2001a), for identifying the language in which a document is written (da Silva et al, 2006), for discovering phrases having similar meanings, because they occur in similar local lexical contexts (Gamallo et al, 2005; 2008), for identifying key terms in documents (Silva et al, 2009).
As a consequence of this wide range activity, in the last three years we defined Translation as one of our leading objectives, and followed a different approach to Phrase-Based Statistical Machine Translation (SMT) regarding the state-of-the-art in this research area (Och and Ney, 2004). As a first difference, we required that extracted word and phrase translations from parallel aligned corpora should be validated, thus introducing a first level of supervision. By iterating on the reuse of correctly acquired term translations for realignment, extraction of unknown term translations and validation, enabled an improvement from a maximum 75.5% alignment precision (when no translation knowledge was used, for the Portuguese (PT) English (EN) pair) (Darriba et al, 2005) to 84.5% precision (Gomes et al, 2009) at an early stage of mentioned iteration. As a consequence, extracted unknown term translations precision was improved, even for very low occurrence frequencies (Aires et al, 2009). And recently, we achieved 71% for the BLEU translation quality measure for translating in the PT>EN direction and 65% for the EN>PT direction. These results improve in 10 BLEU points the results obtained by Koehn et al (2009) for the same language pairs, despite the fact that we did not use phrase reordering model in our experiments. As a second difference, Gomes et al (2009) improved the robust alignment method we had previously created (Ribeiro et al, 2000c). A third difference, which is a direct consequence of periodic term translation extraction supervision, reduces phrase alignment imprecision, thus enabling us to take alignment as granted and not as a hidden variable as in word and phrase-based SMT.
In this framework, in this project,
1.we will work on the translation of four languages – English (EN), Spanish (ES), Portuguese (PT), and Hindi (HI) – 6 language pairs, twelve translation directions – due to their diversity and economical potential. Our own approach to translation will be completed with a reordering model that may incorporate procedures from hierarchical Phrase-Based STM (Chiang, 2007)for handling long distance reordering phenomena.
2.Pivoting will be a main concern: we will make use of EN as a pivot language for enabling translation from/to HI into/from PT and ES, as there is an immense volume of parallel corpora for the HI-EN language pair that is non-overlapping with huge parallel corpora for EN, ES, and PT. Pivoting will also be used for enabling quicker and more precise term translation extraction, by taking existing lexicon for PT-EN and its entries as guides. Using earlier validated term translations, available positive and negative knowledge will be used to train an existing classifier (probably SVMs) and use it to classify new entries as good or bad, contributing to save validation effort.
3.Alignment and translation technologies will be improved by bringing in results from extraction of monolingual phrases, phrase translation, monolingual and bilingual paraphrase identification, from document classification and language identification.
4.In PATRAS project, developed text mining and translation technology used Suffix Arrays. By bringing in a specialist on Suffix Trees (a data structure correlated to Suffix Arrays), an effort will be made for enhancing the use of both data structures in translation.
It is our ultimate goal to show that a machine, whose knowledge acquisition and use is adequately supervised by humans, may improve its own competence and outperform current translation systems, getting progressively closer to human competence.