[PhD] - Translation Alignment and Extraction Within a Lexica-Centered Iterative Workflow
Oct 2009 - Dec 2017
This thesis addresses two closely related problems. The first, translation alignment, consists of identifying bilingual document pairs that are translations of each other within multilingual document collections (document alignment); identifying sentences, titles, etc, that are translations of each other within bilingual document pairs (sentence alignment); and identifying corresponding word and phrase translations within bilingual sentence pairs (phrase alignment). The second is extraction of bilingual pairs of equivalent word and multi-word expressions, which we call translation equivalents (TEs), from sentence- and phrase-aligned parallel corpora.
While these same problems have been investigated by other authors, their focus has
been on fully unsupervised methods based mostly or exclusively on parallel corpora.
Bilingual lexica, which are basically lists of TEs, have not been considered or given enough
importance as resources in the treatment of these problems. Human validation of TEs,
which consists of manually classifying TEs as correct or incorrect translations, has also not
been considered in the context of alignment and extraction. Validation strengthens the
importance of infrequent TEs (most of the entries of a validated lexicon) that otherwise
would be statistically unimportant.
The main goal of this thesis is to revisit the alignment and extraction problems in the
context of a lexica-centered iterative workflow that includes human validation. Therefore, the methods proposed in this thesis were designed to take advantage of knowledge accumulated in human-validated bilingual lexica and translation tables obtained by unsupervised methods. Phrase-level alignment is a stepping stone for several applications, including the extraction of new TEs, the creation of statistical machine translation systems, and the creation of bilingual concordances. Therefore, for phrase-level alignment, the higher accuracy of human-validated bilingual lexica is crucial for achieving higher quality results in these downstream applications.
There are two main conceptual contributions. The first is the coverage maximization approach to alignment, which makes direct use of the information contained in a lexicon, or in translation tables when this is small or does not exist. The second is the introduction of translation patterns which combine novel and old ideas and enables precise and productive extraction of TEs. As material contributions, the alignment and extraction methods proposed in this thesis have produced source materials for three lines of research, in the context of three PhD theses (two of them already defended), all sharing with me the supervision of my advisor. The topics of these lines of research are statistical machine translation, algorithms and data structures for indexing and querying phrase-aligned parallel corpora, and bilingual lexica classification and generation. Four publications have resulted directly from the work presented in this thesis and twelve from the collaborative lines of research.