Graduation details

[PhD] - Time and Space Efficient Data Structures for Supporting Machine Translation Tasks

Sep 2012 - Dec 2017

Abstract: The amount of digital natural language text collections available nowadays is huge and it has been growing at an exponential rate. All this information can be easily accessed by individuals of several nationalities and cultures. This leads to the development of new and innovative techniques and tools, for processing and indexing these texts, in fields of research such as Machine Translation, Natural Language Processing or Cross-Language Information Retrieval. Over the years, a lot of important work has been developed, using efficient data structures, such as suffix arrays, for fast pattern matching and to determine statistics. However, these data structures require a considerable amount of space, around four times the text size, which is a problem considering the amount of bilingual texts available in so many languages. This thesis proposal introduces a two-layer bilingual framework based on compact data structures, for indexing parallel texts, translation memories and bilingual lexica, and their alignments, in pairs of two different languages. Besides a word-based suffix array implementation, this thesis proposal presents a solution based on two byte-codes wavelet trees, one for each text, and bitmaps to represent the alignment. Additionally, it introduces a skip-based bilingual search procedure that speeds up the search time response of the framework, for operations over pairs of word, multi-word or dis-contiguous phrases. For indexing and querying over aligned parallel corpora, the bilingual framework presents a space consumption around 50% of the alignment-annotated corpora size, against the 160% of the non compressed approach. In terms of search time response, the compressed approach is slower than the one based on suffix arrays as expected. The skip-based bilingual search procedure improves the time response from the original bilingual search algorithm from 1.6x to 2.3x in average. With such space requirements, the framework is able to represent huge amounts of data in main memory, avoiding the considerably slower disk accesses, and to support tasks such as translation, text alignment, word-sense disambiguation or context analysis.

Start Date: 27 Sep 2012

End Date: 18 Dec 2017

Post-Graduation by: Jorge Costa

Post-Graduation Supervisor(s): Gabriel Pereira Lopes, Luís Russo

Post-Graduation Jury(s): Pedro Medeiros, Margarida Mamede, Nieves R. Brisaboa, Maria Andrea Rodríguez-Tastets, Miguel Ángel Martínez Prieto