Projects details

PATRAS - PArallelism for Machine Learning of TRAnslationS

May 2005 - Nov 2007

In the framework of projects DIXIT, contract PRAXIS 2/2.1/TIT/1670/95, and TRADAUT-PT, European contract MLIS-4005 TRADAUT-PT 26192, our team developed an innovative procedure, language independent, for aligning bitexts (or parallel texts, i.e., texts that are translations of each other or of a common source text). Using that information and an indexing and retrieval engine, acquired for the TRADAUT-PT project, we built a translation extractor, a monolingual and a bilingual concordancer, and a MT validator. However, as we don't know the source code of that engine, the translation extraction process is computationally heavy. It is difficult to adapt these tools for for delivering services within a reasonable response time. And, worst, it is not easy to make use of built data structures for reusing available information for any volume of parallel text (terabytes of data), and for any number of languages. So, it is our aim to implement a new indexing machine, using adequate data structures (suffix arrays), adequate algorithms and adequate computational architectures (grid computing), in order to solve pointed problems. With this project, we aim at delivering a Global Computing infrastructure for translation services, using an innovative translation technique (Translation as Information Retrieval), that became apparent during the preparation of PATRAS project. We aim at applying our technique to 6 languages --- Portuguese, Spanish, French, German, and Czech ---, from 3 different families (Latin, Germanic and Slavic), for a total of 15 language pairs. It is our goal to build a national, European and world wide service provider for the translation needs of all agents, acting in the Translation and Cross Language Information Retrieval areas. We have the knowledge and the expertise in this field. We just need to make our services available, by effectively tackling response speed, huge number of users, and access to terabytes of information for as many languages as possible. The larger the quantity of aligned parallel text and the number of languages tackled, the larger will be the probability of direct translation success, even for languages for which there is no parallel corpus.

Project type: PN

Project reference:

Coordinated by:

CITI - FCT/UNL - Centro de Informática e Tecnologias de Informação, FCT/UNL

Funding entities:

FCT-MCTES - Fundação para a Ciência e a Tecnologia (MEC)

Total Funding Amount: 80

Local Funding Amount: 80

Start Date: 1 May 2005

End Date: 30 Nov 2007

Participations Joaquim Ferreira da Silva [Researcher], Gabriel Pereira Lopes [Coordinator]

Partnerships

CITI - FCT/UNL - Centro de Informática e Tecnologias de Informação, FCT/UNL

Url: